Entrevista

In this new episode of our podcast, we'll focus on statistical open data. One of the categories of datasets considered to be of high value by the European Union. Today we are going to talk about how this type of data produced by public administrations can become a key tool to better understand reality, make decisions and create new services. We have two guests  for this.

  • María Santana Álvarez, deputy director general of dissemination and communication en ofthe National Institute of Statistics (INE).
  • Alberto González Yanes, deputy director of eStatisticsand DataaAnalysis dat the Canary Islands Institute of Statistics (ISTAC). 

Listen to the full podcast (only available in Spanish)

Summary / Transcript of the interview

1. Why is statistical data considered high-value data? What is its potential?

María Santana Álvarez: In this society in which we live, where data surrounds us and information flows so quickly, it is important that official statistics are known and recognized as high-quality and reliable data, and this is achieved by making them accessible to all of society in an open way. This information is useful for informed decision-making and, therefore, statistical data already has a lot of value, but its reuse increases that value and has a great impact on society. 

In relation to the data produced by the INE, the statistical operations for which we are responsible cover topics as varied as demography, the economy, the labour market, the environment, the service sector, science and technology, and living conditions, among many other topics. I'm going to give you some Specific examples of statistical operationsTurnover IndexStatistics on R+D activitiesMonthly Birth Estimate or the Time Use Survey, in addition to those commonly known as the Consumer Price Index, the Labour Force Survey or the Quarterly accounting. As you can see, official statistical data is of great value and its reuse is essential. 

The definition of high-value datasets has reinforced this. These are data that have a great potential for the benefits for society, the environment, the economy and, in fact, one of the categories established in the Regulationsr. It is statistics, which includes sets related to national accounts, demography or inequality - as you can see, the topics I have mentioned above - and in this category most of the datasets are produced by the INE. 

Alberto González Yanes: In this century – or this beginning of the new century in which we are living – so saturated with information and data, it is important to take into account the importance of statistics in itself within a democratic society and advanced democratic states. Statistics, as objective and transparent data, are important to be present in open formats, not only for the economy - so that new services can be built - but also to reinforce and continue to strengthen data-based decision-making by not only public administrations, but also by companies and citizens. 

One important thing must be taken into account: that the official data, whether published by the INE or by regional institutes such as ISTAC, generates rights and duties. I always give the example of how official data such as the CPI, or the official population figures themselves, generate rights and duties for municipalities, local entities, councils, governments, etc. , 

This level of magnitude, of the importance of statistical data as a fundamental pillar of democratic states - and this is recognised by the United Nations - gives rise to the need for not only the catalogue of the open data set defined by the European Commission's Implementing Regulation  to be of high value, but also for all the data produced by official statistics to be considered of high value.  Because it is fundamental for democratic states

2. Can you explain a little more about the role of ISTAC and the INE in the statistical open data ecosystem? What services based on open data do you offer to citizens?

Alberto González Yanes: The regional and state statistical systems are two legs that are coordinated. We have the great Coordination within the system, within the CITE (Interterritorial Statistics Committee). What the autonomous communities do is either reuse the INE's own information, or expand the information that it is not developed at the national level and that it is necessary for regional purposes. We, for example, are one of the major international benchmarks in the production of tourism statistics, in such a way that we even appear within the systems of World Tourism Organization Best Practices. We offer information at the municipal level on tourism that some states do not even have at the national level. The information we have is reused by all the tourist information systems of all public administrations, but also by hotel employers' associations. That includes the Statistics of a tourist accommodation, Survey on Tourism Expenditure, Statistics on Tourist Movements at Canary Islands Borders  -which we also developed collaboratively with the National Institute of Statistics, expanding the sample for the case of the Canary Islands- and the Tourist Housing Occupancy Survey. These are the great stars of information in an autonomous community that has a GDP of almost 35% linked to tourism.

María Santana Álvarez: In the case of the INE, all our production is offered openly through the website, which is the main meeting point with our users. Proof of this is that last year, in 2025, it received more than 42 million visits. All the data we produce is disseminated according to the publication schedule of statistical operations, free of charge and under an open license. 

I like to talk about this topic in such a pedagogical way, taking  Tim Berners-Lee's five stars as  a reference and making an analogy between the INE's dissemination system and how we are climbing the ladder in that system. The current INE dissemination system is the result of many years of evolution and in this evolution we have opted for the development of tools that make reuse effective. 

Starting with Tim Berners-Lee's stars, one star is that you produce the data and disseminate it openly under a license that allows reuse, but that is not enough for reusers to be able to effectively and easily make use of it. Two stars would be to offer the aggregated data we produce in proprietary formats such as excel and pc-axis. The three stars would be csv, in flat formats. And we come to the fourth star, which is to make information accessible through URI. The URLs are URIs and in the case of the INE we have a JSON API for all the aggregated data we produce. 

In relation to this, I do want to comment on the advantages of having a JSON API. In our case, access is provided to the metadata and aggregated data that we produce. This involves an automatic and direct exploitation of all the information we produce. The data is updated according to the calendar; Regardless of when a user accesses that web service, they will find the latest data that is available. Users who use this system can customize their queries and filter through the metadata that defines tables and series. 

Nor have we forgotten the great R User Community in data science. That's why we've produced a package called INEapir, which incorporates all the functionalities of the JSON API and makes it easier for these reusers to work with our data in an environment that they already know, in systems and data structures to which they are accustomed.

In addition, soon, all the documentation related to the API,   will not only be in the current format that we have on the website, but also in OpenAPI with Swagger. This will allow access to our API information in a more interactive and intuitive way for all those users who are used to using general APIs. 

Alberto González Yanes: It is important to note, first of all, that all statistical data is public by nature, because state statistical regulations – Law 12/1989 – or regional regulations require it to do so. In our case, we have different initiatives that allow reuse. From an ecosystem of about 10 or 15 APIs supported by international standards such as SDMX (Statistical Data and Metadata Exchange), which allows you to take all the information we produce, including the entire open data catalog: management APIs, all the cartography... We have everything in that API ecosystem to which we obviously incorporate connectors, be it Python, or R, with different libraries or specific connectors for some market solutions, to facilitate reuse by third parties in dashboards. 

For us it is also important, apart from opening the data, open the entire part of Semantic Assets. We manage concepts, classifications, registration designs... For us, the Reuse of the entire part of classifications and concepts, apart from all statistical data. One of the main reusers of this entire system is the Government of the Canary Islands itself, incorporating, from the base, from the electronic forms of the electronic administration - and this is sometimes little known - all the standardised classifications that we have. They are doing this through the API of services that we have.

Therefore, we have different proposals, not only for access to data, but also for data processing and normalization. 

3. How do you work to ensure interoperability between your statistical systems, and also with international organizations, such as Eurostat?

María Santana Álvarez: Before, I have been using Tim Berners-Lee's system to tell our level of openness in the INE's dissemination system. I stayed at the fourth star, but in that system there are five stars. And precisely That fifth star guarantees interoperability. From the point of view of dissemination, data that are subject to a National or international classification, such as the National Classification of Economic Activities, from Education, or Occupationsother standards that have been approved by the INE, such as the codes of the Autonomous Communities, provinces and municipalities, will always be accompanied by this metadata. Therefore, The data produced by other actors in this national statistical system that use these same classifications, codes, etc., will be interoperable with each other. That is from the point of view of dissemination, but also from the point of view of production, because in this national statistical system of which the INE is part, we all have to transmit to Eurostat what data we collect and disseminate, aggregated data. This way of establishing interoperability begins long before dissemination, that is, when new statistical operations are established or grouped together, directives and regulations are developed in which methodologies and concepts are established that all Member States have to use. This ensures that when we transmit the microdata or the aggregated results to Eurostat, it is already known that we have taken those same concepts, those same standards as a basis. 

As for the transmission we do, to make it even more standard, SDMX and DSD are used based on data structures and lists of standard codes to ensure comparability and consistency in official European statistics. 

Alberto González Yanes: As María has said, interoperability is a key and fundamental issue within public statistics. He spoke of the standardization of SDMX, which is fundamental and has been a reference even for the W3C, to draw up interoperability standards and ontologies. He spoke of the creation of codes and classifications that are not only usable among us, but also usable by the rest of the public sector. And there I link it a lot with the competence that public statistics has in terms of semantic standardization, according to the National Interoperability Scheme in article 10.3. 

In this sense, as we take them seriously, the Interterritorial Statistics Committee proposed the creation of a statistical interoperability node at the national level, which would facilitate not only the exchange of information between the different statistical bodies of the Spanish State, but also the transmission of administrative data for statistical purposes from the public administrations to the statistical system. It is a benchmark project at European level. It was funded by the European Commission and we hope that throughout 2026 we will begin to deploy the different actions for the development of the node as a reference element at European level. 

4. What are the main current challenges in opening statistical data?

María Santana Álvarez:  I have previously commented that all our production of aggregate data from statistical operations, and also certain Anonymized microdata, are published openly. However, the INE has much more information to offer, but given its nature it cannot be done openly. I am referring to the Sensitive microdata

Let's see a little bit of legal basis in this matter because it is a very sensitive issue. In 2022 there was an amendment to the Public Statistical Function Law, through which statistical services can grant research entities access to confidential data. These data do not allow the direct identification of the units and can only be used to carry out scientific studies of public interest, in addition to the fact that certain requirements must be met to be able to access them. In fact, the statistical services evaluate whether it is possible to provide this information, that is, we are very rigorous in giving access to this data. To give you an idea, the INE managed more than 80 requests for this type of access to confidential microdata last year and a high percentage of these were considered viable. 

In addition, the INE is the coordinator of a project called It is _DataLlab, arising from a agreement signed by the Tax Agency, Social Security, the Bank of Spain and the Public Employment Service. All these organizations are large producers of official statistics, but also holders of a large volume of administrative records. Es_DataLlAB offers researchers the access to sensitive microdata sets resulting from the combination of different databases of at least two of the agencies that we have signed this agreement, but this cannot be offered openly for reasons of confidentiality and statistical secrecy. 

What challenge is on the horizon to be able to provide this type of data, that is, microdata at the level of the reporting unit in an open way, without posing a problem of confidentiality, of statistical secrecy? The solution would be synthetic populations. In fact, at the INE we are working on the construction of these synthetic populations: populations that reproduce the statistical characteristics of the real population, but the records do not correspond to a real reporting unit. It is something fictitious, but that, when statistical analyses are done, have the same characteristics as real populations. This would be a way to openly publish microdata at this level of detail, without having to go through the evaluation committees that we have right now and the restrictions that must be complied with by current legislation. 

5. Finally, how do you see the evolution of open data in the coming years?  What technological or methodological innovations do you think will transform public statistics?

Alberto González Yanes: I think that, in addition, - and we take out that reflection in the National Open Data Meeting when it was held here in Lanzarote – another challenge that we have ahead of us in public statistics is the issue of facilitate the reuse of protected private data by data owners. The Portability concept, which is restricted within public statistics. There is no such concept. While the right of access to confidential data for scientific purposes is included and strengthened by the European regulation, the right of portability is not included. It is true that this is a look beyond the concept of open data, which is assimilated with public data, with certain criteria to facilitate its reuse, but what better reuse than what a company can do, for example, of the data we have in the public statistics itself? That data we have could be put in their information systems.  We must bear in mind that, many times, we have more data from companies than they do, especially in a business structure based on SMEs, such as in the Canary Islands, where companies do not have those gigantic analytical capabilities, or simply to link it with the concept of data economy and put that data on the market. and that profit can be generated from data that we have deposited in our databases. That would require, possibly, a longer-sighted action in ten or fifteen years.

Alberto González Yanes:  We can't end this podcast without talking about artificial intelligence, which seems to be the buzzword in recent years and it's like that for a reason. I think there is a technological disruption in this regard. We have the great challenge of incorporating data and statistical information into generative AI systems, especially to avoid the hallucinations or bias that is occurring in many of them. In addition, as generative AI does not hesitate, but affirms, in some cases data is raised that is not true and can lead to reputational problems, because they say "INE source" or "ISTAC source" and it is not true. So we have the great challenge of accompanying or improving generative artificial intelligence systems to avoid this bias. 

Another great challenge is also to teach citizens in the literacy of the use of these systems. Not only for data access, but also code and transformations are generated based on datasets that we provide and sometimes the calculations are also poorly done.

María Santana Álvarez:  This same reflection is shared internationally and for this reason working groups have begun to be created for the construction of guides that read, interpret and respond appropriately with respect to the questions asked from official statistical data. This requires the use of internationally common metadata and the construction of technology that interprets it properly. Told in a summarized way, it seems little, but the challenge is important and the implementation is not trivial. Of course, it will be worth seeing how it develops and the impact it will have on society. 

Meanwhile, at the INE we are committed to improving the description of web pages, the metadata of our time series, tables, etc., and creating components so that search engines can find our information in a more efficient and accurate way.

Interview clips

1. What open data services does the INE offer to the public?

2. What is ISTAC’s role in the open statistical data ecosystem? What is its relationship with the INE?

calendar icon
Noticia

The European Data Strategy aims to create a single market where data flows between countries and sectors. In this respect, the public sector holds a large amount of data of value to citizens. Much of this data are made openly available through various open data platforms, but there are also data over which third party rights apply, limiting its openness. These data can also be of great interest for scientific research purposes.

The existence of numerous administrative registers and public databases, as well as the evolution of the technologies that allow their management, have led to the availability of large amounts of information in all areas that can be used for the benefit of society, increasing the demand for access by researchers.

In this regard, on 3 June, the Data Governance Act was published in the Official Journal of the European Union. This Act seeks to encourage data sharing in the EU, promoting the so-called Data Economy. Among other issues, the new act contemplates the need to develop mechanisms that facilitate the reuse of this type of data, over which third party rights apply, with all the legal guarantees.

One of these mechanisms are the so-called Safe Reading Rooms, mentioned during the impact assessment prior to the approval of the Act.

What are Safe Rooms?

Safe Rooms are conceived as a single point of contact to support researchers in the re-use of certain protected categories of data held by the public sector. They allow for a controlled processing of the data, while preserving privacy or other rights attached to the data.

In Europe there are various initiatives of this type, such as the CASD (Centre d’Accès Sécurisé aux Données) and the Health Data Hub in France or the Microdata Research Laboratory in Portugal.  In Spain we also have several organisations that have already made Safe Reading Rooms available to researchers.  Let's look at 3 examples.

3 examples of Safe Rooms for data sharing in Spain

Bank of Spain Data Laboratory (BELab)

The Banco de España facilitates access to high-quality microdata, guaranteeing its confidentiality through Secure Rooms. Some of the data it offers are microdata from individual companies of Fintech entities or from the Financial Skills Survey.

Users can access the information both on-site (in Madrid and Barcelona) and remotely, depending on the degree of sensitivity of the information under study. The on-site lab stations, which are isolated without internet access, use Stata, R, Python and Octave for data processing.

To gain access, researchers must submit their CV and an application form explaining the purpose of the research. This application is assessed by a Research Technical Evaluation Committee. If accepted, a series of rules and restrictions are set (timetable, access without a mobile device, etc.).

To guarantee the proper use of the microdata, BELab prepares and supplies the methodological documentation. In addition, technical experts review the work to ensure compliance with the corresponding confidentiality clauses.

Once the work has been completed, the researcher is obliged to mention the source of the data and send a copy of the study carried out. He/she also undertakes not to make any attempt to re-identify the natural or legal persons linked to the data under study.

Social Security Investigation Chambers

Researchers and academics interested in Social Security databases and microdata have at their disposal three Secure Rooms in Madrid, Barcelona and Albacete, which can only be accessed by authorised personnel, without electronic devices. These rooms are equipped with tools such as SAS, STATA, R, Python and Microsoft Office. Remote access is also allowed through secure devices (called "bastioned devices") that are distributed among researchers.

Some of the data available are the Continuous Sample of Working Lives, the Monthly Affiliation or the ERTEs by COVID-19, among others.

As in the case of the Bank of Spain, the interested party will have to send a request by e-mail to solicitudes.sala-investigacion@seg-social.es. A Committee of Experts will evaluate the request. If approved, the necessary data will be prepared, access to which will be allowed through a private personal folder.

The Committee of Experts will also evaluate the outcome of the research, to ensure regulatory compliance. If everything is correct, the study will be published on the Social Security Data Portal.

National Statistics Institute (INE)

The National Statistical Institute is one of the main publishers of open data in our country, but it also holds sensitive data of value that must be treated with the corresponding confidentiality measures. Access to this information for scientific research purposes follows the protocol foreseen in Regulation (EC) No 223/2009 on European Statistics and in the European Statistics Code of Practice.

This service is intended for researchers working or collaborating in recognised research organisations. The process is similar to the previous cases. An application must be submitted, which will be evaluated by the INE. This request must be as detailed as possible, indicating the variables to be consulted, the geographical-temporal level and the justification of the need for this information. Some of these data may incur costs, as established in the Official State Gazette.

 

These three examples illustrate the importance of Safe Rooms in enabling the reuse of valuable data while guaranteeing the confidentiality and privacy of the information. This allows for more in-depth research, which can generate economic and social good. An intensive use of data allows to boost innovation in public sector performance, facilitating the contrast of ideas, promoting creativity and the maximum use of resources in the general framework of a modern, participative, open and useful public management to solve or improve social problems and challenges.

calendar icon