In this new episode of our podcast, we'll focus on statistical open data. One of the categories of datasets considered to be of high value by the European Union. Today we are going to talk about how this type of data produced by public administrations can become a key tool to better understand reality, make decisions and create new services. We have two guests for this.
- María Santana Álvarez, deputy director general of dissemination and communication en ofthe National Institute of Statistics (INE).
- Alberto González Yanes, deputy director of eStatisticsand DataaAnalysis dat the Canary Islands Institute of Statistics (ISTAC).
Listen to the full podcast (only available in Spanish)
Summary / Transcript of the interview
1. Why is statistical data considered high-value data? What is its potential?
María Santana Álvarez: In this society in which we live, where data surrounds us and information flows so quickly, it is important that official statistics are known and recognized as high-quality and reliable data, and this is achieved by making them accessible to all of society in an open way. This information is useful for informed decision-making and, therefore, statistical data already has a lot of value, but its reuse increases that value and has a great impact on society.
In relation to the data produced by the INE, the statistical operations for which we are responsible cover topics as varied as demography, the economy, the labour market, the environment, the service sector, science and technology, and living conditions, among many other topics. I'm going to give you some Specific examples of statistical operations: Turnover Index, Statistics on R+D activities, Monthly Birth Estimate or the Time Use Survey, in addition to those commonly known as the Consumer Price Index, the Labour Force Survey or the Quarterly accounting. As you can see, official statistical data is of great value and its reuse is essential.
The definition of high-value datasets has reinforced this. These are data that have a great potential for the benefits for society, the environment, the economy and, in fact, one of the categories established in the Regulationsr. It is statistics, which includes sets related to national accounts, demography or inequality - as you can see, the topics I have mentioned above - and in this category most of the datasets are produced by the INE.
Alberto González Yanes: In this century – or this beginning of the new century in which we are living – so saturated with information and data, it is important to take into account the importance of statistics in itself within a democratic society and advanced democratic states. Statistics, as objective and transparent data, are important to be present in open formats, not only for the economy - so that new services can be built - but also to reinforce and continue to strengthen data-based decision-making by not only public administrations, but also by companies and citizens.
One important thing must be taken into account: that the official data, whether published by the INE or by regional institutes such as ISTAC, generates rights and duties. I always give the example of how official data such as the CPI, or the official population figures themselves, generate rights and duties for municipalities, local entities, councils, governments, etc. ,
This level of magnitude, of the importance of statistical data as a fundamental pillar of democratic states - and this is recognised by the United Nations - gives rise to the need for not only the catalogue of the open data set defined by the European Commission's Implementing Regulation to be of high value, but also for all the data produced by official statistics to be considered of high value. Because it is fundamental for democratic states.
2. Can you explain a little more about the role of ISTAC and the INE in the statistical open data ecosystem? What services based on open data do you offer to citizens?
Alberto González Yanes: The regional and state statistical systems are two legs that are coordinated. We have the great Coordination within the system, within the CITE (Interterritorial Statistics Committee). What the autonomous communities do is either reuse the INE's own information, or expand the information that it is not developed at the national level and that it is necessary for regional purposes. We, for example, are one of the major international benchmarks in the production of tourism statistics, in such a way that we even appear within the systems of World Tourism Organization Best Practices. We offer information at the municipal level on tourism that some states do not even have at the national level. The information we have is reused by all the tourist information systems of all public administrations, but also by hotel employers' associations. That includes the Statistics of a tourist accommodation, Survey on Tourism Expenditure, Statistics on Tourist Movements at Canary Islands Borders -which we also developed collaboratively with the National Institute of Statistics, expanding the sample for the case of the Canary Islands- and the Tourist Housing Occupancy Survey. These are the great stars of information in an autonomous community that has a GDP of almost 35% linked to tourism.
María Santana Álvarez: In the case of the INE, all our production is offered openly through the website, which is the main meeting point with our users. Proof of this is that last year, in 2025, it received more than 42 million visits. All the data we produce is disseminated according to the publication schedule of statistical operations, free of charge and under an open license.
I like to talk about this topic in such a pedagogical way, taking Tim Berners-Lee's five stars as a reference and making an analogy between the INE's dissemination system and how we are climbing the ladder in that system. The current INE dissemination system is the result of many years of evolution and in this evolution we have opted for the development of tools that make reuse effective.
Starting with Tim Berners-Lee's stars, one star is that you produce the data and disseminate it openly under a license that allows reuse, but that is not enough for reusers to be able to effectively and easily make use of it. Two stars would be to offer the aggregated data we produce in proprietary formats such as excel and pc-axis. The three stars would be csv, in flat formats. And we come to the fourth star, which is to make information accessible through URI. The URLs are URIs and in the case of the INE we have a JSON API for all the aggregated data we produce.
In relation to this, I do want to comment on the advantages of having a JSON API. In our case, access is provided to the metadata and aggregated data that we produce. This involves an automatic and direct exploitation of all the information we produce. The data is updated according to the calendar; Regardless of when a user accesses that web service, they will find the latest data that is available. Users who use this system can customize their queries and filter through the metadata that defines tables and series.
Nor have we forgotten the great R User Community in data science. That's why we've produced a package called INEapir, which incorporates all the functionalities of the JSON API and makes it easier for these reusers to work with our data in an environment that they already know, in systems and data structures to which they are accustomed.
In addition, soon, all the documentation related to the API, will not only be in the current format that we have on the website, but also in OpenAPI with Swagger. This will allow access to our API information in a more interactive and intuitive way for all those users who are used to using general APIs.
Alberto González Yanes: It is important to note, first of all, that all statistical data is public by nature, because state statistical regulations – Law 12/1989 – or regional regulations require it to do so. In our case, we have different initiatives that allow reuse. From an ecosystem of about 10 or 15 APIs supported by international standards such as SDMX (Statistical Data and Metadata Exchange), which allows you to take all the information we produce, including the entire open data catalog: management APIs, all the cartography... We have everything in that API ecosystem to which we obviously incorporate connectors, be it Python, or R, with different libraries or specific connectors for some market solutions, to facilitate reuse by third parties in dashboards.
For us it is also important, apart from opening the data, open the entire part of Semantic Assets. We manage concepts, classifications, registration designs... For us, the Reuse of the entire part of classifications and concepts, apart from all statistical data. One of the main reusers of this entire system is the Government of the Canary Islands itself, incorporating, from the base, from the electronic forms of the electronic administration - and this is sometimes little known - all the standardised classifications that we have. They are doing this through the API of services that we have.
Therefore, we have different proposals, not only for access to data, but also for data processing and normalization.
3. How do you work to ensure interoperability between your statistical systems, and also with international organizations, such as Eurostat?
María Santana Álvarez: Before, I have been using Tim Berners-Lee's system to tell our level of openness in the INE's dissemination system. I stayed at the fourth star, but in that system there are five stars. And precisely That fifth star guarantees interoperability. From the point of view of dissemination, data that are subject to a National or international classification, such as the National Classification of Economic Activities, from Education, or Occupations u other standards that have been approved by the INE, such as the codes of the Autonomous Communities, provinces and municipalities, will always be accompanied by this metadata. Therefore, The data produced by other actors in this national statistical system that use these same classifications, codes, etc., will be interoperable with each other. That is from the point of view of dissemination, but also from the point of view of production, because in this national statistical system of which the INE is part, we all have to transmit to Eurostat what data we collect and disseminate, aggregated data. This way of establishing interoperability begins long before dissemination, that is, when new statistical operations are established or grouped together, directives and regulations are developed in which methodologies and concepts are established that all Member States have to use. This ensures that when we transmit the microdata or the aggregated results to Eurostat, it is already known that we have taken those same concepts, those same standards as a basis.
As for the transmission we do, to make it even more standard, SDMX and DSD are used based on data structures and lists of standard codes to ensure comparability and consistency in official European statistics.
Alberto González Yanes: As María has said, interoperability is a key and fundamental issue within public statistics. He spoke of the standardization of SDMX, which is fundamental and has been a reference even for the W3C, to draw up interoperability standards and ontologies. He spoke of the creation of codes and classifications that are not only usable among us, but also usable by the rest of the public sector. And there I link it a lot with the competence that public statistics has in terms of semantic standardization, according to the National Interoperability Scheme in article 10.3.
In this sense, as we take them seriously, the Interterritorial Statistics Committee proposed the creation of a statistical interoperability node at the national level, which would facilitate not only the exchange of information between the different statistical bodies of the Spanish State, but also the transmission of administrative data for statistical purposes from the public administrations to the statistical system. It is a benchmark project at European level. It was funded by the European Commission and we hope that throughout 2026 we will begin to deploy the different actions for the development of the node as a reference element at European level.
4. What are the main current challenges in opening statistical data?
María Santana Álvarez: I have previously commented that all our production of aggregate data from statistical operations, and also certain Anonymized microdata, are published openly. However, the INE has much more information to offer, but given its nature it cannot be done openly. I am referring to the Sensitive microdata.
Let's see a little bit of legal basis in this matter because it is a very sensitive issue. In 2022 there was an amendment to the Public Statistical Function Law, through which statistical services can grant research entities access to confidential data. These data do not allow the direct identification of the units and can only be used to carry out scientific studies of public interest, in addition to the fact that certain requirements must be met to be able to access them. In fact, the statistical services evaluate whether it is possible to provide this information, that is, we are very rigorous in giving access to this data. To give you an idea, the INE managed more than 80 requests for this type of access to confidential microdata last year and a high percentage of these were considered viable.
In addition, the INE is the coordinator of a project called It is _DataLlab, arising from a agreement signed by the Tax Agency, Social Security, the Bank of Spain and the Public Employment Service. All these organizations are large producers of official statistics, but also holders of a large volume of administrative records. Es_DataLlAB offers researchers the access to sensitive microdata sets resulting from the combination of different databases of at least two of the agencies that we have signed this agreement, but this cannot be offered openly for reasons of confidentiality and statistical secrecy.
What challenge is on the horizon to be able to provide this type of data, that is, microdata at the level of the reporting unit in an open way, without posing a problem of confidentiality, of statistical secrecy? The solution would be synthetic populations. In fact, at the INE we are working on the construction of these synthetic populations: populations that reproduce the statistical characteristics of the real population, but the records do not correspond to a real reporting unit. It is something fictitious, but that, when statistical analyses are done, have the same characteristics as real populations. This would be a way to openly publish microdata at this level of detail, without having to go through the evaluation committees that we have right now and the restrictions that must be complied with by current legislation.
5. Finally, how do you see the evolution of open data in the coming years? What technological or methodological innovations do you think will transform public statistics?
Alberto González Yanes: I think that, in addition, - and we take out that reflection in the National Open Data Meeting when it was held here in Lanzarote – another challenge that we have ahead of us in public statistics is the issue of facilitate the reuse of protected private data by data owners. The Portability concept, which is restricted within public statistics. There is no such concept. While the right of access to confidential data for scientific purposes is included and strengthened by the European regulation, the right of portability is not included. It is true that this is a look beyond the concept of open data, which is assimilated with public data, with certain criteria to facilitate its reuse, but what better reuse than what a company can do, for example, of the data we have in the public statistics itself? That data we have could be put in their information systems. We must bear in mind that, many times, we have more data from companies than they do, especially in a business structure based on SMEs, such as in the Canary Islands, where companies do not have those gigantic analytical capabilities, or simply to link it with the concept of data economy and put that data on the market. and that profit can be generated from data that we have deposited in our databases. That would require, possibly, a longer-sighted action in ten or fifteen years.
Alberto González Yanes: We can't end this podcast without talking about artificial intelligence, which seems to be the buzzword in recent years and it's like that for a reason. I think there is a technological disruption in this regard. We have the great challenge of incorporating data and statistical information into generative AI systems, especially to avoid the hallucinations or bias that is occurring in many of them. In addition, as generative AI does not hesitate, but affirms, in some cases data is raised that is not true and can lead to reputational problems, because they say "INE source" or "ISTAC source" and it is not true. So we have the great challenge of accompanying or improving generative artificial intelligence systems to avoid this bias.
Another great challenge is also to teach citizens in the literacy of the use of these systems. Not only for data access, but also code and transformations are generated based on datasets that we provide and sometimes the calculations are also poorly done.
María Santana Álvarez: This same reflection is shared internationally and for this reason working groups have begun to be created for the construction of guides that read, interpret and respond appropriately with respect to the questions asked from official statistical data. This requires the use of internationally common metadata and the construction of technology that interprets it properly. Told in a summarized way, it seems little, but the challenge is important and the implementation is not trivial. Of course, it will be worth seeing how it develops and the impact it will have on society.
Meanwhile, at the INE we are committed to improving the description of web pages, the metadata of our time series, tables, etc., and creating components so that search engines can find our information in a more efficient and accurate way.
Interview clips
1. What open data services does the INE offer to the public?
2. What is ISTAC’s role in the open statistical data ecosystem? What is its relationship with the INE?
In the last fifteen years we have seen how public administrations have gone from publishing their first open datasets to working with much more complex concepts. Interoperability, standards, data spaces or digital sovereignty are some of the trendy concepts. And, in parallel, the web has also changed. That open, decentralized, and interoperable space that inspired the first open data initiatives has evolved into a much more complex ecosystem, where technologies, new standards, and at the same time important challenges such as information silos to digital ethics and technological concentration coexist.
To talk about all this, today we are fortunate to have two voices that have not only observed this evolution, but have been direct protagonists of it at an international level:
- Josema Alonso, with more than twenty-five years of experience working on the open web, data and digital rights, has worked at the World Wide Web Foundation, the Open Government Partnership and the World Economic Forum, among others.
- Carlos Iglesias, an expert in web standards, open data and open government, has advised administrations around the world on more than twenty projects. He has been actively involved in communities such as W3C, the Web Foundation and the Open Knowledge Foundation.
Listen to the full podcast (only available in Spanish)
Summary / Transcript of the interview
1. At what point do you think we are now and what has changed with respect to that first stage of open data?
Carlos Iglesias: Well, I think what has changed is that we understand that today that initial battle cry of "we want the data now" is not enough. It was a first phase that at the time was very useful and necessary because we had to break with that trend of having data locked up, not sharing data. Let's say that the urgency at that time was simply to change the paradigm and that is why the battle cry was what it was. I have been involved, like Josema, in studying and analyzing all those open data portals and initiatives that arose from this movement. And I have seen that many of them began to grow without any kind of strategy. In fact, several fell by the wayside or did not have a clear vision of what they wanted to do. Simple practice I believe came to the conclusion that the publication of data alone was not enough. And from there I think that they have been proposing, a little with the maturity of the movement, that more things have to be done, and today we talk more about data governance, about opening data with a specific purpose, about the importance of metadata, models. In other words, it is no longer simply having data for the sake of having it, but there is one more vision of data as one of the most valuable elements today, probably, and also as a necessary infrastructure for many things to work today. Just as infrastructures such as road or public transport networks or energy were key in their day. Right now we are at the moment of the great explosion of artificial intelligence. A series of issues converge that have made this explode and the change is immense, despite the fact that we are only talking about perhaps a little more than ten or fifteen years since that first movement of "we want the data now". I think that right now the panorama is completely different.
Josema Alonso: Yes, it is true that we had that idea of "you publish that someone will come and do something with it". And what that did is that people began to become aware. But I, personally, could not have imagined that a few years later we would have even had a directive at European level on the publication of open data. It was something, to be honest, that we received with great pleasure. And then it will begin to be implemented in all member states. That moved consciences a little and moved practices, especially within the administration. There was a lot of fear of "let's see if I put something in there that is problematic, that is of poor quality, that I will be criticized for it", etc. But it began to generate a culture of data and the usefulness of very important data. And as Carlos also commented in recent years, I think that no one doubts this anymore. The investments that are being made, for example, at European level and in Member States, including in our country, in Spain, in the promotion and development of data spaces, are hundreds of millions of euros. Nobody has that kind of doubt anymore and now the focus is more on how to do it well, on how to get everyone to interoperate. That is, when a European data space is created for a specific sector, such as agriculture or health, all countries and organisations can share data in the best possible way, so that they can be exchanged through common models and that they are done within trusted environments.
2. In this context, why have standards become so essential?
Josema Alonso: I think it's because of everything we've learned over the years. We have learned that it is necessary for people to be able to have a certain freedom when it comes to developing their own systems. The architecture of the website itself, for example, is how it works, it does not have a central control or anything, but each participant within the website manages things in their own way. But there are clear rules of how those things then have to interact with each other, otherwise it wouldn't work, otherwise we wouldn't be able to load a web page in different browsers or on different mobile phones. So, what we are seeing lately is that there is an increasing attempt to figure out how to reach that type of consensus in a mutual benefit. For example, part of my current work for the European Commission is in the Semantic Interoperability Community, where we manage the creation of uniform models that are used across Europe, definitions of basic standard vocabularies that are used in all systems. In recent years it has also been instrumentalized in a way that supports, let's say, that consensus through regulations that have been issued, for example, at the European level. In recent years we have seen the regulation of data, the regulation of data governance and the regulation of artificial intelligence, things that also try to put a certain order and barriers. It's not that everyone goes through the middle of the mountain, because if not, in the end we won't get anywhere, but we're all going to try to do it by consensus, but we're all going to try to drive within the same road to reach the same destination together. And I think that, from the part of the public administrations, apart from regulating, it is very interesting that they are very transparent in the way it is done. It is the way in which we can all come to see that what is built is built in a certain way, the data models that are transparent, everyone can see them participate in their development. And this is where we are seeing some shortcomings in algorithmic and artificial intelligence systems, where we do not know very well the data they use or where it is hosted. And this is where perhaps we should have a little more influence in the future. But I think that as long as this duality is achieved, of generating consensus and providing a context in which people feel safe developing it, we will continue to move in the right direction.
Carlos Iglesias: If we look at the principles that made the website work in its day, there is also a lot of focus on the community part and leaving an open platform that is developed in the open, with open standards in which everyone could join. The participation of everyone was sought to enrich that ecosystem. And I think that with the data we should think that this is the way to go. In fact, it's kind of also a bit like the concept that I think is behind data spaces. In the end, it is not easy to do something like that. It's very ambitious and we don't see an invention like the web every day.
3. From your perspective, what are the real risks of data getting trapped in opaque infrastructures or models? More importantly, what can we do to prevent it?
Carlos Iglesias: Years ago we saw that there was an attempt to quantify the amount of data that was generated daily. I think that now no one even tries it, because it is on a completely different scale, and on that scale there is only one way to work, which is by automating things. And when we talk about automation, in the end what you need are standards, interoperability, trust mechanisms, etc. If we look ten or fifteen years ago, which were the companies that had the highest share price worldwide, they were companies such as Ford or General Electric. If you look at the top ten worldwide, today there are companies that we all know and use every day, such as Meta, which is the parent company of Facebook, Instagram, WhatsApp and others, or Alphabet, which is the parent company of Google. In other words, in fact, I think I'm a little hesitant right now, but probably of the ten largest listed companies in the world, all are dedicated to data. We are talking about a gigantic ecosystem and, in order for this to really work and remain an open ecosystem from which everyone can benefit, the key is standardization.
Josema Alonso: I agree with everything Carlos said and we have to focus on not getting trapped. And above all, from the public administrations there is an essential role to play. I mentioned before about regulation, which sometimes people don't like very much because the regulatory map is starting to be extremely complicated. The European Commission, through an omnibus decree, is trying to alleviate this regulatory complexity and, as an example, in the data regulation itself, which obliges companies that have data to facilitate data portability to their users. It seems to me that it is something essential. We're going to see a lot of changes in that. There are three things that always come to mind; permanent training is needed. This changes every day at an astonishing speed. The volumes of data that are now managed are huge. As Carlos said before, a few days ago I was talking to a person who manages the infrastructure of one of the largest streaming platforms globally and he told me that they are receiving requests for data generated by artificial intelligence in such a large volume in just one week as the entire catalog they have available. So the administration needs to have permanent training on these issues of all kinds, both at the forefront of technology as we have just mentioned, and what we talked about before, how to improve interoperability, how to create better data models, etc. Another is the common infrastructure in Europe, such as the future European digital wallet, which would be the equivalent of the national citizen folder. A super simple example we are dealing with is the birth certificate. It is very complicated to try to integrate the systems of twenty-seven different countries, which in turn have regional governments and which in turn have local governments. So, the more we invest in common infrastructure, both at the semantic level and at the level of the infrastructure itself, the cloud, etc., I think the better we will do. And then the last one, which is the need for distributed but coordinated governance. Each one is governed by certain laws at local, national or European level. It is good that we begin to have more and more coordination in the higher layers and that those higher layers permeate to the lower layers and the systems are increasingly easier to integrate and understand each other. Data spaces are one of the major investments at the European level, where I believe this is beginning to be achieved. So, to summarize three things that are very practical to do: permanent training, investing in common infrastructure and that governance continues to be distributed, but increasingly coordinated.
In this podcast we talk about transport and mobility data, a topic that is very present in our day-to-day lives. Every time we consult an application to find out how long a bus will take, we are taking advantage of open data linked to transport. In the same way, when an administration carries out urban planning or optimises traffic flows, it makes use of mobility data.
To delve into the challenges and opportunities behind the opening of this type of data by Spanish public administrations, we have two exceptional guests:
- Tania Gullón Muñoz-Repiso, director of the Division of Transport Studies and Technology of the Ministry of Transport and Sustainable Mobility.
- Alicia González Jiménez, deputy director in the General Subdirectorate of Cartography and Observation of the Territory of the National Geographic Institute.
Listen to the full podcast (only available in Spanish)
Summary / Transcript of the interview
1. Both the IGN and the Ministry generate a large amount of data related to transport. Of all of them, can you tell us which data and services are made available to the public as open data?
Alicia González: On the part of the National Geographic Institute, I would say that everything: everything we produce is available to users, because since the end of 2015 the dissemination policy adopted by the General Directorate of the National Geographic Institute, through the Autonomous Organism National Center for Geographic Information (CNIG), which is where all products and services are distributed, is an open data policy, so that everything is distributed under the CC BY 4.0 license, which protects free and open use. You simply have to make an attribution, a mention of the origin of the data. So we are talking, in general, not only about transport, but about all kinds of data, about more than 100 products that represent more than two and a half million files that users are increasingly demanding. In fact, in 2024 we have had up to 20 million files downloaded, so it is in high demand. And specifically in terms of transmission networks, the fundamental set of data is the Geographic Reference Information of Transport Networks (IGR-RT). It is a multimodal geospatial dataset that is composed of five transport networks that are continuous throughout the national territory and also interconnected. Specifically, it contemplates:
1. The road network that is made up of the entire road network, regardless of its owner and that runs throughout the territory. There are more than 300 thousand kilometers of road that are also connected to all the street maps, to the urban road network of all population centers. That is, we have a road graph that backbones the entire territory, in addition to having connected the roads that are later distributed and disseminated in the National Topographic Map.
2. The second most important network is the rail transport network. It includes all the data of rail transport and also of metro, tram and other types of modes by rail.
3 and 4. In the maritime and air field, the networks are already limited to infrastructures, so that they contain all the ports on the Spanish coast and all the infrastructures of aerodromes, airports, heliports in the air part.
5. And finally, the last network, which is much more modest, is residual data: cable transport.
Everything is interconnected through intermodal relationships. It is a set of data that is generated from official sources. We cannot incorporate just any data, it must always be official data and it is generated within the framework of cooperation of the National Cartographic System.
As a dataset that complies with the INSPIRE Directive both in its definition and in the way it is disseminated through standard web services, it has also been classified as a high-value dataset in the mobility category, in accordance with the High-Value Data Enforcement Regulation. It is a fairly important and normalized set.
How can it be located and accessed? Precisely, as it is standard, it is catalogued in the IDE (Spatial Data Infrastructure) catalogue, thanks to the standard description of its metadata. It can also be located through the official INSPIRE (Information Publication Services) data and services catalog or is accessible through portals as relevant as the open data portal.
Once we have located it, how can the user access it? How can they see the data? There are several ways. The easiest: check your visualizer. All the data is displayed there and there are certain query tools to facilitate its use. And then, of course, through the CNIG download centre. There we publish all the data from all the networks and it is in great demand. And then the last way is to consult the standard web services that we generate, visualization services and downloads of different technologies. In other words, it is a set of data that is available to users for reuse.
Tania Gullón: In the Ministry we also share a lot of open data. I would like, in order not to take too long, to comment in particular on four large sets of data:
1. The first would be the OTLE, the Observatory of Transport and Logistics in Spain, which is an initiative of the Ministry of Transport, whose main objective is to provide a global and comprehensive vision of the situation of transport and logistics in Spain. It is organized into seven blocks: mobility, socio-economy, infrastructure, security, sustainability, metropolitan transport and logistics. These are not georeferenced data, but statistical data. The Observatory makes data, graphs, maps, indicators available to the public and, not only that, but also offers annual reports, monographs, conferences, etc. And also of the observatories that we have cross-border, which are done collaboratively with Portugal and France.
2. The second set of data I want to mention is the NAP, the National Multimodal Transport Access Point, which is an official digital platform managed by the Ministry of Transport, but which is developed collaboratively between the different administrations. Its objective is to centralise and publish all the digitised information on the passenger transport offer in the national territory of all modes of transport. What do we have here? All schedules, services, routes, stops of all transport services, road transport, urban, intercity, rural, discretionary buses on demand. There are 116 datasets. The one of rail transport, the schedules of all those trains, their stops, etc. Also of maritime transport and air transport. And this data is constantly updated in real time. To date, we only have static data in GTFS (General Transit Feed Specification) format, which can also be reused and in a standard format that is useful for the further development of mobility applications by reusers. And while this NAP initially focused on static data, such as those routes, schedules, and stops, progress is being made toward incorporating dynamic data as well. In fact, in December we also have an obligation under European regulations that oblige us to have this data in real time to, in the end, improve all that transport planning and the user experience.
3. The third dataset is Hermes. It is the geographic information system of the general interest transport network. What is its objective? To offer a comprehensive vision, in this case georeferenced. Here I want to refer to what my colleague Alicia has commented, so that you can see how we are all collaborating with each other. We are not inventing anything, but everything is projected on those axes of the roads, for example, RT, the geographical reference information of the transport network. And what is done is to add all these technical parameters, as an added value to have a complete, comprehensive, multimodal information system for roads, railways, ports, airports, railway terminals and also waterways. It is a GIS (Geographic Information System), which allows all this analysis, not only downloading, consulting, with those open web services that we put at the service of citizens, but also in an open data catalog made with CKAN, which I will comment on later. Well, in the end there are more than 300 parameters that can be consulted. What are we talking about? For each section of road, the average traffic intensity, the average speed, the capacity of the infrastructures, planned actions are also known -not only the network in service, but also the planned network, the actions that the Ministry plans to carry out-, the ownership of the road, the lengths, speeds, accidents... well, many parameters, modes of access, co-financed projects, alternative fuels issues, the trans-European transport network, etc. That's the third of the datasets.
4. The fourth set is perhaps the largest because it is 16 GB per day. This is the project we call Big Data Mobility. This project is a pioneering initiative that uses Big Data and artificial intelligence technologies to analyze in depth the mobility patterns in the country is mainly based on the analysis of the anonymized mobile phone records of the population to obtain detailed information on all the movements of people not individualized, but aggregated at the census district level. Since 2020, a daily mobility study has been carried out and all this data is given openly. That is mobility by hours, by origin / destination that allows us to monitor and evaluate the demand for transport to plan improvements in those infrastructures and services. In addition, as data is given in open space, it can be used for any purpose, for tourism purposes, for research...
2. How is this data generated and collected? What challenges do you have to face in this process and how do you solve them?
Alicia González: Specifically, in the field of products that are technologically generated in geographic information system environments and geospatial databases, in the end these are projects in which the fundamental basis is the capture of data and the integration of existing reference sources. When we see that the headline has a piece of information, that is the one that must be integrated. In summary, in the main technical works, the following could be identified:
- On the one hand, capture, that is, when we want to store a geographical object we have to digitize it, draw it. Where? On an appropriate metric basis such as the aerial orthophotographs of the National Plan of Aerial Orthophotography (PNOA), which is also another dataset that is available and open. Well, when we have, for example, to draw or digitize a road, we trace it on that aerial image that PNOA provides us.
- Once we have captured that geometric component, we have to provide it with an attribution and not just any data will do, they have to be official sources. So, we have to locate who is the owner of that infrastructure or who is the provider of the official data to detect what the attributes are, the characterization that we want to give to that information, which in principle was only geometric. To do this, we have to carry out a series of source validation processes, detect that it does not have incidents and processes that we call integration, which are quite complex to guarantee that the result meets what we want.
- And finally, a fundamental phase in all these projects is the assurance of geometric and semantic quality. In other words, a series of quality controls must be developed and executed to validate the product, the final result of that integration and confirm that it meets the requirements indicated in the product specification.
In terms of challenges, a fundamental challenge is data governance, that is, the result that is generated is fed from certain sources, but in the end the result is created. Then you have to define the role of each provider that may later later be a user. Another challenge in this whole process is locating data providers. Sometimes the person responsible for that infrastructure or the object that we want to store in the database does not publish the information in a standardized way or it is difficult to locate because it is not in a catalog. Sometimes it is difficult to locate the official source you need to complete the geographical information. And looking a little at the user, I would highlight that another challenge is to identify, to have the agility to identify in a flexible and fast way the use cases that are changing with users, who are demanding us, because in the end it is about continuing to be relevant to society. Finally, and because the Geographic Institute is a scientific and technical environment and this part affects us a lot, another challenge is digital transformation, that is, we are working on technological projects, so we also have to have a lot of capacity to manage change and adapt to new technologies.
Tania Gullón: Regarding how data is generated and collected and the challenges we face, for example, the NAP, the National Access Point for Multimodal Transport, is a collaborative generation, that is, here the data comes from the autonomous communities themselves, from the consortia and from the transport companies. The challenge is that there are many autonomous communities that are not yet digitized, there are many companies... The digitalisation of the sector is going slowly – it is going, but it is going slowly. In the end there is incomplete data, duplicate data. Governance is not yet well defined. It happens to us that, imagine, the company ALSA raises all its buses, but it has buses in all the autonomous communities. And if at the same time the autonomous community uploads its data, that data is duplicated. It's as simple as that. It is true that we are just starting and that governance is not yet well defined, so that there is no excess data. Before they were missing and now there are almost too many.
In Hermes, the geographic information system, what is done, as I said, is to project it on the information of the transport networks, which is the official one that Alicia mentioned, and data from the different managers and administrators of infrastructures are integrated, such as Adif, Puertos del Estado, AENA, the General Directorate of Roads, ENAIRE, etc. What is the main challenge - if you had to stand out, because we can talk about this for an hour? It has cost us a lot, we have been working on this project for seven years and it has cost a lot because, first, people did not believe it. They didn't think it was going to work and they didn't collaborate. In the end, all this is knocking on the door of Adif, of AENA and changing that awareness in which data cannot be in a drawer, but must all be put at the service of the common good. And I think that's what has cost us a little more. In addition, there is the issue of governance, which Alicia has already commented on. You go to ask for a piece of information and in the organization itself they do not know who is the owner of that data, because perhaps the traffic data is handled by different departments. And who owns it? All this is very important.
We have to say that Hermes has been the great promoter of the Data offices, of the Adif Data office. In the end they have realized that what they needed was to put their house in order, as well as in everyone's house and in the Ministry as well, that Data offices are needed.
In the Big Data project, how is the data generated? In this case it is completely different. It is a pioneering project, more of new technologies, in which data is generated from anonymized mobile phone records. So, by reconstructing all that large amount of Big Data data, of the records that are in each antenna in Spain, with artificial intelligence and with a series of algorithms, these matrices are reconstructed and made. Then, those data from that sample – in the end we have a sample of 30% of the population, of more than 13 million mobile lines – is extrapolated with open data from the INE. And then, what do we do as well? It is calibrated with external sources, that is, with sources of certain reference, such as AENA ticketing, flights, Renfe data, etc. We calibrate this model to be able to generate these matrices with quality. The challenges: that it is very experimental. To give you an idea, we are the only country that has all this data. So we have been opening a gap and learning along the way. The difficulty is, again, the data. That data to calibrate, it is difficult for us to find it and to be given it with a certain periodicity and so on, because this goes in real time and we permanently need that flow of data. Also the adaptation to the user, as Alicia has said. We must adapt to what society and the reusers of this Big Data are demanding. And we must also keep pace, as Alicia said, with technology, which is not the same as the telephony data that exists now as it was two years ago. And the great challenge of quality control. But well, here I think I'm going to leave Alicia, who is the super expert, to explain to us what mechanisms exist to ensure that the data are reliable and updated and comparable. And then I will give you my vision, if you like.
Alicia González: How can reliability, updating and comparison be guaranteed? I don't know if reliability can be guaranteed, but I think there may be a couple of indicators that are especially relevant. One is the degree to which a set of data conforms to the regulations that concern it. In the field of geographic information, the way of working is always standardized, that is, there is a family of ISO 19100 on Geographic Information/Geomatics or the INSPIRE Directive itself, which greatly conditions the way of working and publishing data. And also, looking at the public administration, I think that the official seal should also be a guarantee of reliability. In other words, when we process data we must do so in a homogeneous and unbiased way, while perhaps, perhaps, a private company may be conditioned by them. I believe that these two parameters are important, that they can indicate reliability.
In terms of the degree of updating and comparison of the data, I believe that the user deduces this information from the metadata. The metadata at the end is the cover letter for the datasets. So, if a dataset is correctly and truthfully metadatad, and if it is also made according to standard profiles – the same in the GEO field, since we are talking about the INPIRE or GeoDCAT-AP profile – if different datasets are defined in their metadata according to these standardized profiles, it is much easier to see if they are comparable and the user can determine and decide if it finally satisfies their update and comparability with another dataset.
Tania Gullón: Totally Alicia. And if you allow me to add, we, for example, in Big Data have always been very committed to measuring quality – more so when they are new technologies that, at first, people did not trust the results that come out of all this. Always trying to measure this quality - which, in this case, is very difficult because they are large data sets - from the beginning we started designing processes that take time. The daily quality control process of the data takes seven hours, but it is true that at the beginning we had to detect if an antenna had fallen, if something had happened... Then we do a control with statistical parameters and other internal consistency and what we detect here are the anomalies. What we are seeing is that 90% of the anomalies that come out are real mobility anomalies. In other words, there are no errors in the data, but they are anomalies: there has been a demonstration or there has been a football match. These are issues that distort mobility. Or there's been a storm or a rain or anything like that. And it is important not only to control that quality and see if there are anomalies, but we also believe that it is very important to publish those quality criteria: how we are measuring quality and above all the results. Not only do we give the data on a daily basis, but we also give this metadata, which Alicia says, of quality, of what the sample was like that day, of those values that have been obtained from anomalies. This also occurs in the open: not only the data, but the metadata. And then we also publish the anomalies and the reason for those errors. When errors are found we say "okay, there has been an anomaly because in the town - I don't know what to imagine, it is all of Spain - del Casar was the festival of the Casar cake". And that's it, the anomaly has been found and it is published.
And how do you measure another quality parameter: thematic accuracy? In this case, comparing with sources of true reference. We know that evolution with respect to itself is already very controlled with that internal logical consistency, but we also have to compare it with what happens in the real world. I talked about it before with Alicia, we said "the data is reliable, but what is the reality of mobility? Who knows her?" In the end we have some clues, such as in the tickets of how many have boarded the buses. If we have that data, we have a clue, but of the people who walk and the people who take their cars and so on, what is the reality? It is very difficult to have a point of comparison, but we do compare it with all the data from AENA, Renfe, bus concessions and all these controls are passed to determine how far we deviate from that reality that we can know.
3. All this data serves as a basis for developing applications and solutions, but it is also essential when it comes to making decisions and accelerating the implementation of the central axes, for example, the Safe, Sustainable and Connected Mobility Strategy or the Sustainable Mobility Bill. How is this data used to make these real decisions?
Tania Gullón: If you will allow me, I would first like to introduce this strategy and the Law on data for those who do not know it. One of the axes, axis 5 of the Ministry's Safe, Sustainable and Connected Mobility Strategy 2030 is "Smart Mobility". And it is precisely focused on this and its main objective is to promote digitalisation, innovation and the use of advanced technologies to improve efficiency, sustainability and user experience in Spain's transport system. And precisely one of the measures of this axis is the "facilitation of Mobility as a Service, Open Data and New Technologies". In other words, this is where all these projects that we are commenting on are framed. In fact, one submeasure is to promote the publication of open mobility data, another is to carry out analysis of mobility flows and another of the measures, the last, is the creation of an integrated mobility data space. I would like to emphasize - and here I am already in line with this Bill that we hope we will soon see approved - that the Law, in Article 89, regulates the National Access Point, which we also see how it is included in this legislative instrument. And then the Law establishes a key digital instrument for the National Sustainable Mobility System: look at the importance given to the data that in a mobility law it is written that this integrated mobility data space is a key digital instrument. This data space is a reliable data sharing ecosystem, materialized as the digital infrastructure managed by the Ministry of Transport and in coordination with SEDIA (the Secretary of State for Digitalization and Artificial Intelligence), whose objective is to centralize and structure the information on mobility generated by public administrations, transport operators, infrastructure managers, etc. and guarantee that open access to all this data for all administrations under regulatory conditions.
Alicia González: In this case, I want to say that any objective decision-making, of course, has to be made based on data that, as we said before, has to be reliable, up-to-date and comparable. In this sense, it should be noted that the IGN, the fundamental support it offers to the Ministry for the deployment of the Safe, Sustainable and Connected Mobility Strategy, is the provision of service data and complex analysis of geospatial information. Many of them, of course, about the set of data that we have been talking about transport networks.
In this sense, we would like to mention as an example the accessibility maps with which we contribute to axis 1 of the "Mobility for all" strategy, in which, through the Rural Mobility Table, the IGN was asked if we could generate maps that represented the cost in time and distance that it costs any citizen. Living in any population centre, access to the nearest transport infrastructure, starting with the road network. In other words, how much it costs a user in terms of effort, time and distance, to access the nearest motorway or dual carriageway from their home and then, by extension, to any road in the basic network. We did that analysis - so I said that this network is the backbone of the entire territory, it is continuous - and we finally published those results via the web. They are also open data, any user can consult them and, in addition, we also offer them not only numerically, but also represented in different types of maps. In the end, this geolocated visibility of the result provides fundamental value and facilitates, of course, strategic decision-making in terms of infrastructure planning.
Another example to highlight that is possible thanks to the availability of open data is the calculation of monitoring indicators of the Sustainable Development Goals of the 2030 Agenda. Currently, in collaboration with the National Institute of Statistics, we are working on the calculation of several of them, including one directly associated with Transport, which seeks to monitor goal 11, which is to make cities more inclusive, safe, resilient and sustainable.
4. Speaking of this data-based decision-making, there is also cooperation at the level of data generation and reuse between different public administrations. Can you tell us about any examples of a project?
Tania Gullón: I also answer that to data-based decision-making, which I have previously beaten around the bush with the issue of the Law. It can also be said that all this Big Data data, Hermes and everything we have discussed is favouring this shift of the Ministry and organisations towards data-based organisations, which means that decisions are based on that analysis of objective data. When you ask like that for an example, I have so many that I wouldn't know what to tell you. In the case of Big Data data, it has been used for infrastructure planning for a few years now. Before, it was done with surveys and it was sized because how many lanes do I put on a road? Or something very basic, how often do we need on a train? Well, if you don't have data on what the demand is going to be, you can't plan it. This is done with Big Data data, not only by the Ministry but, as it is open, it is used by all administrations, all city councils and all infrastructure managers. Knowing the mobility needs of the population allows us to adapt our infrastructures and our services to these real needs. For example, commuter services in Galicia are now being studied. Or imagine the burying of the A-5. They are also used for emergencies, which we have not commented on, but they are also key. We always realize that when there is an emergency, suddenly everyone thinks "data, where is data, where is the open data?", because they have been fundamental. I can tell you, in the case of the Dana, which is perhaps the most recent, several commuter train lines were seriously affected, the tracks were destroyed, and 99% of the vehicles of the people who lived in Paiporta, in Torrent, in the entire affected area, were disabled. And 1% was because he was not in the Dana area at the time. So mobility had to be restored as soon as possible, because thanks to this open data in a week there were buses doing alternative transport services that had been planned with Big Data data. In other words, look at the impact on the population.
Speaking of emergencies, this project was born precisely because of an emergency, because of COVID. In other words, the study, this Big Data, was born in 2020 because the Presidency of the Government was in charge of monitoring this mobility on a daily basis and giving it openly. And here I link with that collaboration between administrations, organizations, companies, universities. Because look, these mobility data fed the epidemiological models. Here we work with the Carlos III Institute, with the Barcelona Supercomputing Center, with these institutes and research centers that were beginning to size hospital beds for the second wave. When we were still in the first wave, we didn't even know what a wave was and they were already telling us "be careful, because there is going to be a second wave, and with this mobility data and so on we will be able to measure how many beds are going to be needed, according to the epidemiological model". Look at the important reuse. We know that this data, for example, from Big Data is being used by thousands of companies, administrations, research centers, researchers around the world. In addition, we receive inquiries from Germany, from all countries, because in Spain we are a bit of a pioneer in this matter of giving all the data openly. We are there creating a school and not only for transportation, but for tourism issues as well, for example.
Alicia González: In the field of geographic information, at the level of cooperation, we have a specific instrument that is the National Cartographic System, which directly promotes coordination in the actions of the different administrations in terms of geographic information. We do not know how to work in any other way than by cooperating. And a clear example is the same set we have been talking about: the set of geographic reference information on transport networks is the result of this cooperation. That is to say, at the national level it is promoted and promoted by the Geographic Institute, but in its updating, regional cartographic agencies with different ranges of collaboration also participate in its production. The maximum is even reached for co-production of data from certain subsets in certain areas. In addition, one of the characteristics of this product is that it is generated from official data from other sources. In other words, there is already collaboration there no matter what. There is cooperation because there is an integration of data, because in the end it has to be filled in with the official data. And to begin with, perhaps it is data provided by INE, the Cadastre, the cartographic agencies themselves, the local street maps... But, once the result has been formed, as I mentioned before, the result has an added value that is of interest to the original supplier itself. For example, this dataset is reused internally, at home, in the IGN: any product or service that requires transport information is fed into this dataset. There is an internal reuse there, but also in the field of public administrations, at all levels. In the state sector, for example, in the Cadastre, once the result has been generated, it is of interest to them for studies to analyse the delimitation of the public domain associated with infrastructures, for example. Or the Ministry itself, as Tania commented before. Hermes was generated from RT data processing, from transport network data. The Directorate-General for Roads uses transport networks in its internal management to make a traffic map, its catalogue management, etc. And in the autonomous communities themselves, the result generated is also useful to them in cartographic agencies or even at the local level. Then there is a continuous cyclical reuse, as it should be, in the end everything is public money and it has to be reused as much as possible. And in the private sphere, it is also reused and value-added services are generated from this data that are provided in multiple use cases. Not to go on too long, simply that: we participate by providing data on which value-added services are generated.
5. And finally, you can briefly recap some ideas that highlight the impact on daily life and the commercial potential of this data for reusers.
Alicia González: Very briefly, I think that the fundamental impact on everyday life is that the distribution of open data has made it possible to democratize access to data for everyone, for companies, but also for citizens; and, above all, I think it has been fundamental in the academic field, where surely, currently, it is easier to develop certain investigations that in other times were more complex. And another impact on daily life is the institutional transparency that this implies. And as for the commercial potential of reusers, I reiterate the previous idea: the availability of data drives innovation and the increase of value-added solutions. In this sense, looking at one of the conclusions of the report that was carried out in 2024 by ASEDIE; the Association of Infomedia Companies, on the impact that the geospatial data published by the CNIG had on the private sector, there were a couple of quite important conclusions. One of them said that every time a new set of data is released, reusers are incentivized to generate value-added solutions and, in addition, it allows them to focus their efforts on this development of innovation and not so much on data capture. And it was also clear from that report that since the adoption of the open data policy that I mentioned at the beginning, which was adopted in 2015 by the IGN, 75% of the companies surveyed responded that they had been able to significantly expand the catalogue of products and services based on this open data. Then, I believe that the impact is ultimately enriching for society as a whole.
Tania Gullón: I subscribe to all of Alicia's words, I totally agree. And also, that small transport operators and municipalities with fewer resources have at their disposal all this open and free quality data and access to digital tools that allow them to compete on equal terms. In the case of companies or municipalities, imagine being able to plan their transport and be more efficient. Not only does it save them money, but they win in the end in the service to the citizen. And of course, the fact that in the public sector decisions are made based on data and this ecosystem of data sharing is encouraged, favouring the development of mobility applications, for example, has a direct impact on people's daily lives. Or also the issue of transport aid: the study of the impact of transport subsidies with accessibility data and so on. You study who are the most vulnerable and in the end, what do you do? Well, that policies are increasingly fairer and this obviously impacts the citizen. That decisions about how to invest everyone's money, our taxes, how to invest it in infrastructure or aid or services, should be based on objective data and not on intuitions, but on real data. This is the most important thing.
Interview clips
1. What data does the Ministry of Transport and Sustainable Mobility make publicly available?
2. What data does the National Geographic Institute (IGN) make publicly available?
In this episode we talk about the environment, focusing on the role that data plays in the ecological transition. Can open data help drive sustainability and protect the planet? We found out with our two guests:
- Francisco José Martínez García, conservation director of the natural parks of the south of Alicante.
- José Norberto Mazón, professor of computer languages and systems at the University of Alicante.
Listen to the full podcast (only available in Spanish)
Summary / Transcript of the interview
1. You are both passionate about the use of data for society, how did you discover the potential of open data for environmental management?
Francisco José Martínez: For my part, I can tell you that when I arrived at the public administration, at the Generalitat Valenciana, the Generalitat launched a viewer called Visor Gva, which is open, which gives a lot of information on images, metadata, data in various fields... And the truth is that it made it much easier for me - and continues to make it easier - for me to work on the resolution of files and the work of a civil servant. Later, another database was also incorporated, which is the Biodiversity Data Bank, which offers data in grids of one kilometer by one kilometer. And finally, already applied to the natural spaces and wetlands that I direct, water quality data, all of them are open and can be the object of generating applied research by all researchers.
Jose Norberto Mazón: In my case, it was precisely with Francisco as director. He directs three natural parks that are wetlands in the south of Alicante and about one of them, in which we had special interest, which is the Natural Park of Laguna de la Mata and Torrevieja, Francisco told us about his experience -all this experience that he has just commented on-. We at the University of Alicante have been working for some time on data management, open data, data interoperability, etc., and we saw the opportunity to make a perspective of data management, data generation and reuse of data from the territory, from the Natural Park itself. Together with other entities such as Proyecto Mastral, Faunatura, AGAMED, and also colleagues from the Polytechnic University of Valencia, we saw the possibility of studying these useful data, focusing above all on the concept of high-value data, which the European Union was betting on them: data that has the potential to generate socio-economic or environmental benefits, benefit all users and contribute to making a European society based on the data economy. And well, we set out there to see how we could collaborate, especially to discover the potential of data at the territory level.
2. Through a strategy called the Green Deal, the European Union aims to become the world's first competitive and resource-efficient economy, achieving net-zero greenhouse gas emissions by 2050. What concrete measures are most urgent to achieve this and how can data help to achieve these goals?
Francisco José Martínez: The European Union has several lines, several projects such as the LIFE project, focused on endangered species, the ERDF funds to restore habitats... Here in Laguna de la Mata and Torrevieja, we have improved terrestrial habitats with these ERDF funds and it is precisely about these habitats being better CO2 capturers and generating more native plant communities, eliminating invasive species. Then we also have the regulation, at the regulatory level, on nature restoration, which has been in force since 2024, and which requires us to restore up to 30% of degraded terrestrial and marine ecosystems. I must also say that the Biodiversity Foundation, under the Ministry, generates quite a few projects related, for example, to the generation of climate shelters in urban areas. In other words, there are a series of projects and a lot of funding in everything that has to do with renaturalization, habitat improvement and species conservation.
Jose Norberto Mazón: I would also focus, to complement what Francis has said, on all data management, the importance given to data management at the level of the European Green Deal, specifically with data sharing projects, to make data more interoperable. In other words, in the end, all those actors that generate data can be useful through their combination and generate much more value in what are called data spaces and especially in the data space of the European Green Deal. Recently, in addition, they have just finished some initial projects. For example, to highlight a couple of them, the USAGE project (Urban Data Spaces for Green dEal), which I am going to comment on with two specific pilots that they have developed very interestingly. One on how everything that has to do with data to mitigate climate change has to be introduced into urban management in the city of Ferrara, in Italy. And another pilot on data governance and how it has to be done to comply with the FAIR principles, in this case in Zaragoza, with a concept of climate islands that is also very interesting. And then there is another project, AD4GD (All Data for Green Deal) that has also carried out pilots in relation to this interoperability of data. In this case, in the Berlin Lake Network. Berlin has about 300 lakes that have to monitor the quality of the water, the quantity of water, etc. and it has been done through sensorization. The management of biological corridors in Catalonia, too, with data on how species move and how it is necessary to manage these biological corridors. And they have also done some air quality initiatives with citizen science. These projects have already been completed, but there is a super interesting project at the European level that is going to launch this large data space of the European Pact, which is the SAGE (Sustainable Green Europe Data Space) project, which is developing ten use cases that encompass this entire great area of the European Green Deal. Specifically, to highlight one that is very pertinent, because it is aligned with what are the natural parks, the wetlands of the south of Alicante and that Francisco directs, is that of the commitments between nature and ecosystem services. That is, how nature must be protected, how we have to conserve, but we also have to allow these socio-economic activities in a sustainable way. This data space will integrate remote sensing, models based on artificial intelligence, data, etc.
3. Would you like to add any other projects at this local or regional level?
Francisco José Martínez: Yes, of course. Well, the one we have done with Norberto, his team and several teams, several departments of the Polytechnic University of Valencia and the University of Alicante, and it is the digital twin. Research has been carried out for the generation of a digital twin in the Natural Park of Las Lagunas, here in Torrevieja. And the truth is that it has been an applied research, a lot of data has been generated from sensors, also from direct observations or from image and sound recorders. A good record of information has been made at the level of noise, climate, meteorological data to be able to carry out good management and that it is an invaluable help for the management of those of us who have to make decisions day by day. Other data that have also been carried out in this project here has been the collection of data of a social nature, tourist use, people's feelings (whether they agree with what they see in the natural space or not). In other words, we have improved our knowledge of this natural space thanks to this digital twin and that is information that neither our viewer nor the Biodiversity Data Bank can provide us.
Jose Norberto Mazón: Francisco was talking, for example, about the knowledge of people, about the influx of people from certain areas of the natural park. And also to know what they feel, what the people who visit it think, because if it is not through surveys that are very cumbersome, etcetera is complicated. We have put at the service of discovering that knowledge, this digital twin with a multitude of that sensorization and with data that in the end are also interoperable and that allow us to know the territory very well. Obviously, the fact that it is territorial does not mean that it is not scalable. What we are doing with the digital twin project, the ChanTwin project, what we are doing is that it can be dumped or extrapolated to any other natural area, because the problems that we have had in the end we are going to find in any natural area, such as connectivity problems, interoperability problems of data that come from sensors, etc. We have sensors of many types, influx of people, water quality, temperatures and climatic variables, pollution, etc. and in the end also with all the guarantees of data privacy. I have to say this, which is very important because we always try to ensure that this data collection, of course, guarantees people's privacy. We can know the concerns of the people who visit the park and also, for example, the origin of those people. And this is very interesting information at the level of park management, because in this way, for example, Francisco can make more informed decisions to better manage the park. But, the people who visit the park come from a specific municipality, with a city council that, for example, has a Department of the Environment or has a Department of Tourism. And this information can be very interesting to highlight certain aspects, for example, environmental, biodiversity, or socio-economic activity.
Francisco José Martínez: Data are fundamental in the management of the natural environment of a wetland, a mountain, a forest, a pasture... in general of all natural spaces. Note that only with the follow-up and monitoring of certain environmental parameters do we serve to explain events that can happen, for example, a fish mortality. Without having had the history of the dissolved oxygen temperature data, it is very difficult to know if it is because of that or because of a pollutant. For example, the temperature of water, which is related to dissolved oxygen: the higher the temperature, the less dissolved oxygen. And without oxygen, it turns out that they appear in spring and summer – okay, whatever the ambient temperatures are, it moves to the water, to the lagoons, to the wetlands – a disease appears that is botulism and there have already been two years that more than a thousand animals have died every year. The way to control it is by anticipating that these temperatures are going to reach a specific one, that from there the oxygen almost disappears from the waters and gives us time to plan the work teams that are removing the corpses, which is the fundamental action to avoid it. Another, for example, is the monthly census of waterfowl, which are observed in person, which are recorded, which we also have recorders that record sounds. With that we can know the dynamics when species come in migration and with that we can also manage water. Another example can be that of the temperature of the lagoon here in La Mata, which we are monitoring with the digital twin, because we know that when it reaches almost thirty degrees, the main food of the birds disappears, which is brine shrimp, because they cannot live in those extreme temperatures with that salinity. but we can bring in sea water, which despite the fact that it has been very hot these last springs and summers, is always cooler and we can refresh and extend the life of this species that is precisely synchronized with the reproduction of birds. So we can manage the water thanks to the monitoring and thanks to the data we have on the temperatures of the waters.
Jose Norberto Mazón: Look at the importance of these examples that Francisco mentioned, which are paradigmatic, and also the importance of the use of data. I would simply add a question that in the end these data, the effort is to make them all open and that they comply with those FAIR principles, that is, that they are interoperable, because as we have heard Francis have commented, they are data from many sources, each with different characteristics, collected in different ways, etc. You're talking to us about sensor data, but also other data that is collected in another way. And then also that they allow us in some way to start co-creation processes of tools that use this data at various levels. Of course, at the level of management of the natural park itself to make informed decisions, but also at the level of citizenship, even at the level of other types of professionals. As Francisco said, in the parks, in these wetlands, economic activities are carried out and therefore also being able to co-create tools with these actors or with the university research staff themselves is very interesting. And here it is always a matter of encouraging third parties, both natural and legal, for example, companies or startups, entrepreneurs, etc. that they make various applications and value-added services with that data: that they design easy-to-use tools for decision-making, for example, or any other type of tool. This would be very interesting, because it would also give us an entrepreneurial ecosystem around that data. And what this would also do is make society itself more involved from this open data, from the reuse of open data, in environmental care and environmental awareness.
4. An important aspect of this transition is that it must be "fair and leave no one behind". What role can data play in ensuring that equity?
Francisco José Martínez: In our case, we have been carrying out citizen science actions with the Environmental Education and Dissemination technicians. We are collecting data with people who sign up for these activities. We do two activities a month and, for example, we have carried out censuses of bats of different species - because one sees bats and does not distinguish the species, sometimes not even seeing them - on night routes, to detect and record them. We have also done photo trapping activities to detect mammals that are very difficult to see. With this we get children, families, people in general to know a fauna that they do not know exists when they are walking in the mountains. And I believe that we reach a lot of people and that we are disseminating it to as many people, as many sectors as possible.
Jose Norberto Mazón: And from that data, in fact, look at all the amount of data that Francis is talking about. From there, and promoting that line that Francisco follows as director of the Natural Parks of the south of Alicante, what we ask ourselves is: can we go one step further using technology? And we have made video games that make it possible to have more awareness among those target groups that may otherwise be very difficult to reach. For example, teenagers, who must be instilled in some way that behavior, that importance of natural parks as well. And we think that video games can be a very interesting channel. And how have we done it? Basing these video games on data, on data that come from what Francisco has commented on and also from the data of the digital twin itself. That is, data we have on the water surface, noise levels... We include all this data in video games. They are dynamic video games that allow us to have a better awareness of what the natural park is and of the environmental values and conservation of biodiversity.
5. You've been talking to us for a while about all the data you use, which in the end comes from various sources. Can we summarize the type of data you use in your day-to-day life and what are the challenges you encounter when integrating it into specific projects?
Francisco José Martínez: The data are spatial, they are images with their metadata, censuses of birds, mammals, the different taxonomic groups, fauna, flora... We also carry out inventories of protected flora in danger of extinction. Fundamental meteorological data that, by the way, are also very important when it comes to the issue of civil protection. Look at all the disasters that there are with cold drops or cut-off lows. Very important data such as water quality, physical and chemical data, height of the water sheet that helps us to know evaporation, evaporation curves and thus manage water inputs and of course, social data for public use. Because public use is very important in natural spaces. It is a way of opening up to citizens, to people so that they can know their natural resources and know them, value them and thus protect them. As for the difficulty, it is true that there is a series of data, especially when research is carried out, which we cannot access. They are in repositories for technicians who are in the administration or even for consultants who are difficult to access. I think Norberto can explain this better: how this could be integrated into platforms, by sectors, by groups...
Jose Norberto Mazón: In fact, it is a core issue for us. In the end there is a lot of open data, as Francis has explained throughout this little time that we have been talking, but it is true that they are very dispersed because they are also generated to meet various objectives. In the end, the main objective of open data is that it is reused, that is, that it is used for purposes other than those for which it was initially granted. But what we find is that in the end there are many proposals that are, as we would say, top-down (very top down). But really, where the problem lies is in the territory, from below, in all the actors involved in the territory, which apart from a lot of data is generated in the territory itself. In other words, it is true that there is data, for example, satellite data with remote sensing, which is generated by the satellites themselves and then reused by us, but then the data that comes from sensors or the data that comes from citizen science, etc., are generated in the territory itself. And we find that many times, in the end of that data, for example, if there are researchers who do a job in a specific natural park, then obviously that research team publishes its articles and data in open (because by the law of science they have to publish them in open in repositories). But of course, that is very research-oriented. So, the other types of actors, for example, the management of the park, the managers of a local entity or even the citizens themselves, are perhaps not aware that this data is available and do not even have mechanisms to consult it and obtain value from it. The greatest difficulty, in fact, is this, in that the data generated from the territory is reused from the territory. It is very easy to reuse them from the territory to solve these problems as well. And that difficulty is what we are trying to tackle with these projects that we have underway, at the moment with the creation of a data lake, a data architecture that allows us to manage all that heterogeneity of the data and do it from the territory. But of course, here what we really have to do is try to do it in a federated way, with that philosophy of open data at the federated level and also with a plus as well, because it is true that the casuistry within the territory is very large. There are a multitude of actors, because we are talking about open data, but there may also be actors who say "I want to share certain data, but not certain other data yet, because I may lose a certain competitiveness, but I would not mind being able to share it in three months' time". In other words, it is also necessary to have control over a certain type of data and that open data coexists with another type of data that can be shared. Maybe not so broadly, but in a way, let's say, providing great value. We are looking at this possibility with a new project that we are creating: a space for environmental data, biodiversity in these three natural parks in the south of the province of Alicante, and we are working on that project: Heleade.
If you want to know more about these projects, we invite you to visit their websites:
Interview clips
1. How was the digital twin of the Lagunas de Torrevieja Natural Park conceived?
2. What projects are being promoted within the framework of the European Green Deal Data Space?
Do you know why it is so important to categorize datasets? Do you know the references that exist to do it according to the global, European and national standards? In this podcast we tell you the keys to categorizing datasets and guide you to do it in your organization.
- David Portolés, Project Manager of the Advisory Service
- Manuel Ángel Jáñez, Senior Data Expert
Listen to the full podcast (only available in Spanish)
Summary / Transcript of the interview
1. What do we mean when we talk about cataloguing data and why is it so important to do so?
David Portolés: When we talk about cataloguing data, what we want is to describe it in a structured way. In other words, we talk about metadata: information related to data. Why is it so important? Because thanks to this metadata, interoperability is achieved. This word may sound complicated, but it simply means that systems can communicate with each other autonomously.
Manuel Ángel Jañez: Exactly, as David says, categorizing is not just labeling. It is about providing data with properties that make it understandable, accessible and reusable. For that we need agreements or standards. If each producer defines their own rules, consumers will not be able to interpret them correctly, and value is lost. Categorizing is reaching consensus between the general and the specific, and this is not new: it is an evolution of library documentation, adapted to the digital environment.
2. So we understand that interoperability is speaking the same language to get the most out of it. What references are there at global, European and national level?
Manuel Ángel Jáñez: The way to describe data is in an open way, using standards or reference specifications, of frames.
-
Globally: DCAT (a W3C recommendation) allows you to model catalogs, datasets, distributions, services, etc. In essence, all the entities that are key and that are then reused in the rest of the profiles.
-
At the European level: DCAT-AP, the application profile in data portals in the European Union, particularly those corresponding to the public sector. This is essentially what is used for the Spanish profile, DCAT-AP-ES.
-
In Spain: DCAT-AP-ES, is the context in which more specific restrictions are incorporated at the Spanish level. It is a profile based on the 2013 Technical Standard for Interoperability (NTI). This profile adds new features, evolves the model to make it compatible with the European standard, adds features related to high-value sets (HVDs), and adapts the standard to the present of the data ecosystem.
David Portolés: With a good description, the reuser can search, retrieve and locate the datasets that are of interest to them and, on the other hand, discover other new datasets that they had not contemplated. The standards, the models, the shared vocabularies. The main difference between them is the degree of detail they apply. The key is to reach the compromise between being as general as possible so that they are not restrictive, but, on the other hand, it is necessary to be specific, it is necessary that they are also specific. Although we talk a lot about open data, these standards also apply to protected data that can be described. The universe of application of these standards is very broad.
3. Focusing on DCAT-AP-ES, what help or resources are there for a user to implement it?
David Portolés: DCAT-AP-ES is a set of rules and basic application models. Like any technical standard, it has an application guide and, in addition, there is an online implementation guide with examples, conventions, frequently asked questions and spaces for technical and informative discussion. This guide has a very clear purpose, the idea is to create a community around this technical standard, with the purpose of generating a knowledge base accessible to all, a transparent and open support channel for anyone who wants to participate.
Manuel Ángel Jañez: The available resources do not start from zero. Everything is aligned with European initiatives such as SEMIC, which promotes semantic interoperability in the EU. We want a living and dynamic tool that evolves with the needs, under a participatory approach, with good practices, debates, harmonisation of the profile, etc. In short, the aim is for the model to be useful, robust, easy to maintain over time and flexible enough so that anyone can participate in its improvement.
4. Is there any existing thematic implementation in DCAT-AP-ES?
Manuel Ángel Jáñez: Yes, important steps have been taken in that direction. For example, the model of high-value sets has already been included, key for data relevant to the economy or society, useful for AI, for example. DCAT-AP-ES is inspired by profiles such as DCAT-AP v2.1.1 (2022) that incorporates some semantic improvements, but there are still thematic implementations to be incorporated into DCAT-AP-ES, such as data series. The idea is that thematic extensions will enable modelling for specific datasets.
David Portolés: As Manu says, the idea is that he is a living model. Future possible extensions are:
- Geographical data: GeoDCAT-AP (European).
- Statistical data: StatDCAT-AP.
In addition, future directives on high-value data will have to be taken into account.
5. And what are the next objectives for the development of DCAT-AP-ES?
David Portolés: The main objective is to achieve full adoption by:
-
Vendors: To change the way they offer and disseminate their metadata relative to their datasets with this new paradigm
-
Reusers: that integrate the new profile in their developments, in their systems, and in all the integrations they have made so far, and that they can make much better derivative products.
Manuel Ángel Jáñez: Also to maintain coherence with international standards such as DCAT-AP. We want to continue to be committed to an agile, participatory technical governance model aligned with emerging technologies (such as protected data, sovereign data infrastructures and data spaces). In short: that DCAT-AP-ES is useful, flexible and future-prepared.
Interview clips
1. Why is it important to catalog data?
2. How can we describe data in open formats?
Collaborative culture and citizen open data projects are key to democratic access to information. This contributes to free knowledge that allows innovation to be promoted and citizens to be empowered.
In this new episode of the datos.gob.es podcast, we are joined by two professionals linked to citizen projects that have revolutionized the way we access, create and reuse knowledge. We welcome:
- Florencia Claes, professor and coordinator of Free Culture at the Rey Juan Carlos University, and former president of Wikimedia Spain.
- Miguel Sevilla-Callejo, researcher at the CSIC (Spanish National Research Council) and Vice-President of the OpenStreetMap Spain association.
Listen to the full podcast (only available in Spanish)
Summary / Transcript of the interview
1. How would you define free culture?
Florencia Claes: It is any cultural, scientific, intellectual expression, etc. that as authors we allow any other person to use, take advantage of, reuse, intervene in and relaunch into society, so that another person does the same with that material.
In free culture, licenses come into play, those permissions of use that tell us what we can do with those materials or with those expressions of free culture.
2. What role do collaborative projects have within free culture?
Miguel Sevilla-Callejo: Having projects that are capable of bringing together these free culture initiatives is very important. Collaborative projects are horizontal initiatives in which anyone can contribute. A consensus is structured around them to make that project, that culture, grow.
3. You are both linked to collaborative projects such as Wikimedia and OpenStreetMap. How do these projects impact society?
Florencia Claes: Clearly the world would not be the same without Wikipedia. We cannot conceive of a world without Wikipedia, without free access to information. I think Wikipedia is associated with the society we are in today. It has built what we are today, also as a society. The fact that it is a collaborative, open, free space, means that anyone can join and intervene in it and that it has a high rigor.
So, how does it impact? It impacts that (it will sound a little cheesy, but...) we can be better people, we can know more, we can have more information. It has an impact on the fact that anyone with access to the internet, of course, can benefit from its content and learn without necessarily having to go through a paywall or be registered on a platform and change data to be able to appropriate or approach the information.
Miguel Sevilla-Callejo: We call OpenStreetMap the Wikipedia of maps, because a large part of its philosophy is copied or cloned from the philosophy of Wikipedia. If you imagine Wikipedia, what people do is they put encyclopedic articles. What we do in OpenStreetMap is to enter spatial data. We build a map collaboratively and this assumes that the openstreetmap.org page, which is where you could go to look at the maps, is just the tip of the iceberg. That's where OpenStreetMap is a little more diffuse and hidden, but most of the web pages, maps and spatial information that you are seeing on the Internet, most likely in its vast majority, comes from the data of the great free, open and collaborative database that is OpenStreetMap.
Many times you are reading a newspaper and you see a map and that spatial data is taken from OpenStreetMap. They are even used in agencies: in the European Union, for example, OpenStreetMap is being used. It is used in information from private companies, public administrations, individuals, etc. And, in addition, being free, it is constantly reused.
I always like to bring up projects that we have done here, in the city of Zaragoza. We have generated the entire urban pedestrian network, that is, all the pavements, the zebra crossings, the areas where you can circulate... and with this a calculation is made of how you can move around the city on foot. You can't find this information on sidewalks, crosswalks and so on on a website because it's not very lucrative, such as getting around by car, and you can take advantage of it, for example, which is what we did in some jobs that I directed at university, to be able to know how different mobility is with blind people. in a wheelchair or with a baby carriage.
4. You are telling us that these projects are open. If a citizen is listening to us right now and wants to participate in them, what should they do to participate? How can you be part of these communities?
Florencia Claes: The interesting thing about these communities is that you don't need to be formally associated or linked to them to be able to contribute. In Wikipedia you simply enter the Wikipedia page and become a user, or not, and you can edit. What is the difference between making your username or not? In that you will be able to have better access to the contributions you have made, but we do not need to be associated or registered anywhere to be able to edit Wikipedia.
If there are groups at the local or regional level related to the Wikimedia Foundation that receive grants and grants to hold meetings or activities. That's good, because you meet people with the same concerns and who are usually very enthusiastic about free knowledge. As my friends say, we are a bunch of geeks who have met and feel that we have a group of belonging in which we share and plan how to change the world.
Miguel Sevilla-Callejo: In OpenStreetMap it is practically the same, that is, you can do it alone. It is true that there is a bit of a difference with respect to Wikipedia. If you go to the openstreetmap.org page, where we have all the documentation – which is wiki.OpenStreetMap.org – you can go there and you have all the documentation.
It is true that to edit in OpenStreetMap you do need a user to better track the changes that people make to the map. If it were anonymous there could be more of a problem, because it is not like the texts in Wikipedia. But as Florencia said, it's much better if you associate yourself with a community.
We have local groups in different places. One of the initiatives that we have recently reactivated is the OpenStreetMap Spain association, in which, as Florencia said, we are a group of those who like data and free tools, and there we share all our knowledge. A lot of people come up to us and say "hey, I just entered OpenStreetMap, I like this project, how can I do this? How can I do the other?"And well, it's always much better to do it with other colleagues than to do it alone. But anyone can do it.
5. What challenges have you encountered when implementing these collaborative projects and ensuring their sustainability over time? What are the main challenges, both technical and social, that you face?
Miguel Sevilla-Callejo: One of the problems we find in all these movements that are so horizontal and in which we have to seek consensus to know where to move forward, is that in the end it is relatively problematic to deal with a very diverse community. There is always friction, different points of view... I think this is the most problematic thing. What happens is that, deep down, as we are all moved by enthusiasm for the project, we end up reaching agreements that make the project grow, as can be seen in Wikimedia and OpenStreetMap themselves, which continue to grow and grow.
From a technical point of view, for some things in particular, you have to have a certain computer prowess, but we are very, very basic. For example, we have made mapathons, which consist of us meeting in an area with computers and starting to put spatial information in areas, for example, where there has been a natural disaster or something like that. Basically, on a satellite image, people place little houses where they see - little houses there in the middle of the Sahel, for example, to help NGOs such as Doctors Without Borders. That's very easy: you open it in the browser, open OpenStreetMap and right away, with four prompts, you're able to edit and contribute.
It is true that, if you want to do things that are a little more complex, you have to have more computer skills. So it is true that we always adapt. There are people who are entering data in a very pro way, including buildings, importing data from the cadastre... and there are people like a girl here in Zaragoza who recently discovered the project and is entering the data they find with an application on their mobile phone.
I do really find a certain gender bias in the project. That within OpenStreetMap worries me a little, because it is true that a large majority of the people we are editing, including the community, are men and that in the end does mean that some data has a certain bias. But hey, we're working on it.
Florencia Claes: In that sense, in the Wikimedia environment, that also happens to us. We have, more or less worldwide, 20% of women participating in the project against 80% of men and that means that, for example, in the case of Wikipedia, there is a preference for articles about footballers sometimes. It is not a preference, but simply that the people who edit have those interests and as they are more men, we have more footballers, and we miss articles related, for example, to women's health.
So we do face biases and we face that coordination of the community. Sometimes people with many years participate, new people... and achieving a balance is very important and very difficult. But the interesting thing is when we manage to keep in mind or remember that the project is above us, that we are building something, that we are giving something away, that we are participating in something very big. When we become aware of that again, the differences calm down and we focus again on the common good which, after all, I believe is the goal of these two projects, both in the Wikimedia environment and OpenStreetMap.
6. As you mentioned, both Wikimedia and OpenStreetMap are projects built by volunteers. How do you ensure data quality and accuracy?
Miguel Sevilla-Callejo: The interesting thing about all this is that the community is very large and there are many eyes watching. When there is a lack of rigor in the information, both in Wikipedia – which people know more about – but also in OpenStreetMap, alarm bells go off. We have tracking systems and it's relatively easy to see dysfunctions in the data. Then we can act quickly. This gives a capacity, in OpenStreetMap in particular, to react and update the data practically immediately and to solve those problems that may arise also quite quickly. It is true that there has to be a person attentive to that place or that area.
I've always liked to talk about OpenStreetMap data as a kind of - referring to it as it is done in the software - beta map, which has the latest, but there can be some minimal errors. So, as a strongly updated and high-quality map, it can be used for many things, but for others of course not, because we have another reference cartography that is being built by the public administration.
Florencia Claes: In the Wikimedia environment we also work like this, because of the mass, because of the number of eyes that are looking at what we do and what others do. Each one, within this community, is assuming roles. There are roles that are scheduled, such as administrators or librarians, but there are others that simply: I like to patrol, so what I do is keep an eye on new articles and I could be looking at the articles that are published daily to see if they need any support, any improvement or if, on the contrary, they are so bad that they need to be removed from the main part or erased.
The key to these projects is the number of people who participate and everything is voluntary, altruistic. The passion is very high, the level of commitment is very high. So people take great care of those things. Whether data is curated to upload to Wikidata or an article is written on Wikipedia, each person who does it, does it with great affection, with great zeal. Then time goes by and he is aware of that material that he uploaded, to see how it continued to grow, if it was used, if it became richer or if, on the contrary, something was erased.
Miguel Sevilla-Callejo: Regarding the quality of the data, I find interesting, for example, an initiative that the Territorial Information System of Navarre has now had. They have migrated all their data for planning and guiding emergency routes to OpenStreetMap, taking their data. They have been involved in the project, they have improved the information, but taking what was already there [in OpenStreetMap], considering that they had a high quality and that it was much more useful to them than using other alternatives, which shows the quality and importance that this project can have.
7. This data can also be used to generate open educational resources, along with other sources of knowledge. What do these resources consist of and what role do they play in the democratization of knowledge?
Florencia Claes: OER, open educational resources, should be the norm. Each teacher who generates content should make it available to citizens and should be built in modules from free resources. It would be ideal.
What role does the Wikimedia environment have in this? From housing information that can be used when building resources, to providing spaces to perform exercises or to take, for example, data and do work with SPARQL. In other words, there are different ways of approaching Wikimedia projects in relation to open educational resources. You can intervene and teach students how to identify data, how to verify sources, to simply make a critical reading of how information is presented, how it is curated, and make, for example, an assessment between languages.
Miguel Sevilla-Callejo: OpenStreetMap is very similar. What's interesting and unique is what the nature of the data is. It's not exactly information in different formats like in Wikimedia. Here the information is that free spatial database that is OpenStreetMap. So the limits are the imagination.
I remember that there was a colleague who went to some conferences and made a cake with the OpenStreetMap map. He would feed it to the people and say, "See? These are maps that we have been able to eat because they are free." To make more serious or more informal or playful cartography, the limit is only your imagination. It happens exactly the same as with Wikipedia.
8. Finally, how can citizens and organizations be motivated to participate in the creation and maintenance of collaborative projects linked to free culture and open data?
Florencia Claes: I think we have to clearly do what Miguel said about the cake. You have to make a cake and invite people to eat cake. Seriously talking about what we can do to motivate citizens to reuse this data, I believe, especially from personal experience and from the groups with which I have worked on these platforms, that the interface is friendly is a very important step.
In Wikipedia in 2015, the visual editor was activated. The visual editor made us join many more women to edit Wikipedia. Before, it was edited only in code and code, because at first glance it can seem hostile or distant or "that doesn't go with me". So, to have interfaces where people don't need to have too much knowledge to know that this is a package that has such and such data and I'm going to be able to read it with such a program or I'm going to be able to dump it into such and such a thing and make it simple, friendly, attractive... I think that this is going to remove many barriers and that it will put aside the idea that data is for computer scientists. And I think that data goes further, that we can really take advantage of all of them in very different ways. So I think it's one of the barriers that we should overcome.
Miguel Sevilla-Callejo: It didn't happen to us that until about 2015 (forgive me if it's not exactly the date), we had an interface that was quite horrible, almost like the code edition you have in Wikipedia, or worse, because you had to enter the data knowing the labeling, etc. It was very complex. And now we have an editor that basically you're in OpenStreetMap, you hit edit and a super simple interface comes out. You don't even have to put labeling in English anymore, it's all translated. There are many things pre-configured and people can enter the data immediately and in a very simple way. So what that has allowed is that many more people come to the project.
Another very interesting thing, which also happens in Wikipedia, although it is true that it is much more focused on the web interface, is that around OpenStreetMap an ecosystem of applications and services has been generated that has made it possible, for example, to appear mobile applications that, in a very fast, very simple way, allow data to be put directly on foot on the ground. And this makes it possible for people to enter the data in a simple way.
I wanted to stress it again, although I know that we are reiterating all the time in the same circumstance, but I think it is important to comment on it, because I think that we forget that within the projects: we need people to be aware again that data is free, that it belongs to the community, that it is not in the hands of a private company, that it can be modified, that it can be transformed, that behind it there is a community of voluntary, free people, but that this does not detract from the quality of the data, and that it reaches everywhere. So that people come closer and don't see us as a weirdo. I think that Wikipedia is much more integrated into society's knowledge and now with artificial intelligence much more, but it happens to us in OpenStreetMap, that they look at you like saying "but what are you telling me if I use another application on my mobile?" or you're using ours, you're using OpenStreetMap data without knowing it. So we need to get closer to society, to get them to know us better.
Returning to the issue of association, that is one of our objectives, that people know us, that they know that this data is open, that it can be transformed, that they can use it and that they are free to have it to build, as I said before, what they want and the limit is their imagination.
Florencia Claes: I think we should somehow integrate through gamification, through games in the classroom, the incorporation of maps, of data within the classroom, within the day-to-day schooling. I think we would have a point in our favour there. Given that we are within a free ecosystem, we can integrate visualization or reuse tools on the same pages of the data repositories that I think would make everything much friendlier and give a certain power to citizens, it would empower them in such a way that they would be encouraged to use them.
Miguel Sevilla-Callejo: It's interesting that we have things that connect both projects (we also sometimes forget the people of OpenStreetMap and Wikipedia), that there is data that we can exchange, coordinate and add. And that would also add to what you just said.
Interview clips
1. What is OpenStreetMap?
2. How does Wikimedia help in the creation of Open Educational Resources?
Open knowledge is knowledge that can be reused, shared and improved by other users and researchers without noticeable restrictions. This includes data, academic publications, software and other available resources. To explore this topic in more depth, we have representatives from two institutions whose aim is to promote scientific production and make it available in open access for reuse:
- Mireia Alcalá Ponce de León, Information Resources Technician of the Learning, Research and Open Science Area of the Consortium of University Services of Catalonia (CSUC).
- Juan Corrales Corrillero, Manager of the data repository of the Madroño Consortium.
Listen to the full podcast (only available in Spanish)
Summary / Transcript of the interview
1. Can you briefly explain what the institutions you work for do?
Mireia Alcalá: The CSUC is the Consortium of University Services of Catalonia and is an organisation that aims to help universities and research centres located in Catalonia to improve their efficiency through collaborative projects. We are talking about some 12 universities and almost 50 research centres.
We offer services in many areas: scientific computing, e-government, repositories, cloud administration, etc. and we also offer library and open science services, which is what we are closest to. In the area of learning, research and open science, which is where I am working, what we do is try to facilitate the adoption of new methodologies by the university and research system, especially in open science, and we give support to data management research.
Juan Corrales: The Consorcio Madroño is a consortium of university libraries of the Community of Madrid and the UNED (National University of Distance Education) for library cooperation.. We seek to increase the scientific output of the universities that are part of the consortium and also to increase collaboration between the libraries in other areas. We are also, like CSUC, very involved in open science: in promoting open science, in providing infrastructures that facilitate it, not only for the members of the Madroño Consortium, but also globally. Apart from that, we also provide other library services and create structures for them.
2. What are the requirements for an investigation to be considered open?
Juan Corrales: For research to be considered open there are many definitions, but perhaps one of the most important is given by the National Open Science Strategy, which has six pillars.
One of them is that it is necessary to put in open access both research data and publications, protocols, methodologies.... In other words, everything must be accessible and, in principle, without barriers for everyone, not only for scientists, not only for universities that can pay for access to these research data or publications. It is also important to use open source platforms that we can customise. Open source is software that anyone, in principle with knowledge, can modify, customise and redistribute, in contrast to the proprietary software of many companies, which does not allow all these things. Another important point, although this is still far from being achieved in most institutions, is allowing open peer review, because it allows us to know who has done a review, with what comments, etc. It can be said that it allows the peer review cycle to be redone and improved. A final point is citizen science: allowing ordinary citizens to be part of science, not only within universities or research institutes.
And another important point is adding new ways of measuring the quality of science.
Mireia Alcalá: I agree with what Juan says. I would also like to add that, for an investigation process to be considered open, we have to look at it globally. That is, include the entire data lifecycle. We cannot talk about a science being open if we only look at whether the data at the end is open. Already at the beginning of the whole data lifecycle, it is important to use platforms and work in a more open and collaborative way.
3. Why is it important for universities and research centres to make their studies and data available to the public?
Mireia Alcalá: I think it is key that universities and centres share their studies, because a large part of research, both here in Spain and at European and world level, is funded with public money. Therefore, if society is paying for the research, it is only logical that it should also benefit from its results. In addition, opening up the research process can help make it more transparent, more accountable, etc. Much of the research done to date has been found to be neither reusable nor reproducible. What does this mean? That the studies that have been done, almost 80% of the time someone else can't take it and reuse that data. Why? Because they don't follow the same standards, the same mannersand so on. So, I think we have to make it extensive everywhere and a clear example is in times of pandemics. With COVID-19, researchers from all over the world worked together, sharing data and findings in real time, working in the same way, and science was seen to be much faster and more efficient.
Juan Corrales: The key points have already been touched upon by Mireia. Besides, it could be added that bringing science closer to society can make all citizens feel that science is something that belongs to us, not just to scientists or academics. It is something we can participate in and this can also help to perhaps stop hoaxes, fake news, to have a more exhaustive vision of the news that reaches us through social networks and to be able to filter out what may be real and what may be false.
4.What research should be published openly?
Juan Corrales: Right now, according to the law we have in Spain, the latest Law of science, all publications that are mainly financed by public funds or in which public institutions participatemust be published in open access. This has not really had much repercussion until last year, because, although the law came out two years ago, the previous law also said so, there is also a law of the Community of Madrid that says the same thing.... but since last year it is being taken into account in the evaluation that the ANECA (the Quality Evaluation Agency) does on researchers.. Since then, almost all researchers have made it a priority to publish their data and research openly. Above all, data was something that had not been done until now.
Mireia Alcalá: At the state level it is as Juan says. We at the regional level also have a law from 2022, the Law of science, which basically says exactly the same as the Spanish law. But I also like people to know that we have to take into account not only the state legislation, but also the calls for proposals from where the money to fund the projects comes from. Basically in Europe, in framework programmes such as Horizon Europe, it is clearly stated that, if you receive funding from the European Commission, you will have to make a data management plan at the beginning of your research and publish the data following the FAIR principles.
5. Among other issues, both CSUC and Consorcio Madroño are in charge of supporting entities and researchers who want to make their data available to the public. How should a process of opening research data be? What are the most common challenges and how do you solve them?
Mireia Alcalá: In our repository, which is called RDR (from Repositori de Dades de Recerca), it is basically the participating institutions that are in charge of supporting the research staff.. The researcher arrives at the repository when he/she is already in the final phase of the research and needs to publish the data yesterday, and then everything is much more complex and time consuming. It takes longer to verify this data and make it findable, accessible, interoperable and reusable.
In our particular case, we have a checklist that we require every dataset to comply with to ensure this minimum data quality, so that it can be reused. We are talking about having persistent identifiers such as ORCID for the researcher or ROR to identify the institutions, having documentation explaining how to reuse that data, having a licence, and so on. Because we have this checklist, researchers, as they deposit, improve their processes and start to work and improve the quality of the data from the beginning. It is a slow process. The main challenge, I think, is that the researcher assumes that what he has is data, because most of them don't know it. Most researchers think of data as numbers from a machine that measures air quality, and are unaware that data can be a photograph, a film from an archaeological excavation, a sound captured in a certain atmosphere, and so on. Therefore, the main challenge is for everyone to understand what data is and that their data can be valuable to others.
And how do we solve it? Trying to do a lot of training, a lot of awareness raising. In recent years, the Consortium has worked to train data curation staff, who are dedicated to helping researchers directly refine this data. We are also starting to raise awareness directly with researchers so that they use the tools and understand this new paradigm of data management.
Juan Corrales: In the Madroño Consortium, until November, the only way to open data was for researchers to pass a form with the data and its metadata to the librarians, and it was the librarians who uploaded it to ensure that it was FAIR. Since November, we also allow researchers to upload data directly to the repository, but it is not published until it has been reviewed by expert librarians, who verify that the data and metadata are of high quality. It is very important that the data is well described so that it can be easily found, reusable and identifiable.
As for the challenges, there are all those mentioned by Mireia - that researchers often do not know they have data - and also, although ANECA has helped a lot with the new obligations to publish research data, many researchers want to put their data running in the repositories, without taking into account that they have to be quality data, that it is not enough to put them there, but that it is important that these data can be reused later.
6. What activities and tools do you or similar institutions provide to help organisations succeed in this task?
Juan Corrales: From Consorcio Madroño, the repository itself that we use, the tool where the research data is uploaded, makes it easy to make the data FAIR, because it already provides unique identifiers, fairly comprehensive metadata templates that can be customised, and so on. We also have another tool that helps create the data management plans for researchers, so that before they create their research data, they start planning how they're going to work with it. This is very important and has been promoted by European institutions for a long time, as well as by the Science Act and the National Open Science Strategy.
Then, more than the tools, the review by expert librarians is also very important. There are other tools that help assess the quality of adataset, of research data, such as Fair EVA or F-Uji, but what we have found is that those tools at the end what they are evaluating more is the quality of the repository, of the software that is being used, and of the requirements that you are asking the researchers to upload this metadata, because all our datasets have a pretty high and quite similar evaluation. So what those tools do help us with is to improve both the requirements that we're putting on our datasets, on our datasets, and to be able to improve the tools that we have, in this case the Dataverse software, which is the one we are using.
Mireia Alcalá: At the level of tools and activities we are on a par, because we have had a relationship with the Madroño Consortium for years, and just like them we have all these tools that help and facilitate putting the data in the best possible way right from the start, for example, with the tool for making data management plans. Here at CSUC we have also been working very intensively in recent years to close this gap in the data life cycle, covering issues of infrastructures, storage, cloud, etc. so that, when the data is analysed and managed, researchers also have a place to go. After the repository, we move on to all the channels and portals that make it possible to disseminate and make all this science visible, because it doesn't make sense for us to make repositories and they are there in a silo, but they have to be interconnected. For many years now, a lot of work has been done on making interoperability protocols and following the same standards. Therefore, data has to be available elsewhere, and both Consorcio Madroño and we are everywhere possible and more.
7. Can you tell us a bit more about these repositories you offer? In addition to helping researchers to make their data available to the public, you also offer a space, a digital repository where this data can be housed, so that it can be located by users.
Mireia Alcalá: If we are talking specifically about research data, as we and Consorcio Madroño have the same repository, we are going to let Juan explain the software and specifications, and I am going to focus on other repositories of scientific production that CSUC also offers. Here what we do is coordinate different cooperative repositories according to the type of resource they contain. So, we have TDX for thesis, RECERCAT for research papers, RACO for scientific journals or MACO, for open access monographs. Depending on the type of product, we have a specific repository, because not everything can be in the same place, as each output of the research has different particularities. Apart from the repositories, which are cooperative, we also have other spaces that we make for specific institutions, either with a more standard solution or some more customised functionalities. But basically it is this: we have for each type of output that there is in the research, a specific repository that adapts to each of the particularities of these formats.
Juan Corrales: In the case of Consorcio Madroño, our repository is called e-scienceData, but it is based on the same software as the CSUC repository, which is Dataverse.. It is open source software, so it can be improved and customised. Although in principle the development is managed from Harvard University in the United States, institutions from all over the world are participating in its development - I don't know if thirty-odd countries have already participated in its development.
Among other things, for example, the translations into Catalan have been done by CSUC, the translation into Spanish has been done by Consorcio Madroño and we have also participated in other small developments. The advantage of this software is that it makes it much easier for the data to be FAIR and compatible with other points that have much more visibility, because, for example, the CSUC is much larger, but in the Madroño Consortium there are six universities, and it is rare that someone goes to look for a dataset in the Madroño Consortium, in e-scienceData, directly. They usually search for it via Google or a European or international portal. With these facilities that Dataverse has, they can search for it from anywhere and they can end up finding the data that we have at Consorcio Madroño or at CSUC.
8. What other platforms with open research data, at Spanish or European level, do you recommend?
Juan Corrales: For example, at the Spanish level there is the FECYT, the Spanish Foundation for Science and Technology, which has a box that collects the research data of all Spanish institutions practically. All the publications of all the institutions appear there: Consorcio Madroño, CSUC and many more.
Then, specifically for research data, there is a lot of research that should be put in a thematic repository, because that's where researchers in that branch of science are going to look. We have a tool to help choose the thematic repository. At the European level there is Zenodo, which has a lot of visibility, but does not have the data quality support of CSUC or the Madroño Consortium. And that is something that is very noticeable in terms of reuse afterwards.
Mireia Alcalá: At the national level, apart from Consorcio Madroño's and our own initiatives, data repositories are not yet widespread. We are aware of some initiatives under development, but it is still too early to see their results. However, I do know of some universities that have adapted their institutional repositories so that they can also add data. And while this is a valid solution for those who have no other choice, it has been found that software used in repositories that are not designed to handle the particularities of the data - such as heterogeneity, format, diversity, large size, etc. - are a bit lame. Then, as Juan said, at the European level, it is established that Zenodo is the multidisciplinary and multiformat repository, which was born as a result of a European project of the Commission. I agree with him that, as it is a self-archiving and self-publishing repository - that is, I Mireia Alcalá can go there in five minutes, put any document I have there, nobody has looked at it, I put the minimum metadata they ask me for and I publish it -, it is clear that the quality is very variable. There are some things that are really usable and perfect, but there are others that need a little more TLC. As Juan said, also at the disciplinary level it is important to highlight that, in all those areas that have a disciplinary repository, researchers have to go there, because that is where they will be able to use their most appropriate metadata, where everybody will work in the same way, where everybody will know where to look for those data.... For anyone who is interested there is a directory called re3data, which is basically a directory of all these multidisciplinary and disciplinary repositories. It is therefore a good place for anyone who is interested and does not know what is in their discipline. Let him go there, he is a good resource.
9. What actions do you consider to be priorities for public institutions in order to promote open knowledge?
Mireia Alcalá: What I would basically say is that public institutions should focus on making and establishing clear policies on open science, because it is true that we have come a long way in recent years, but there are times when researchers are a bit bewildered. And apart from policies, it is above all offering incentives to the entire research community, because there are many people who are making the effort to change their way of working to become immersed in open science and sometimes they don't see how all that extra effort they are making to change their way of working to do it this way pays off. So I would say this: policies and incentives.
Juan Corrales: From my point of view, the theoretical policies that we already have at the national level, at the regional level, are usually quite correct, quite good. The problem is that often no attempt has been made to enforce them. So far, from what we have seen for example with ANECA - which has promoted the use of data repositories or research article repositories - they have not really started to be used on a massive scale. In other words, incentives are necessary, and not just a matter of obligation. As Mireia has also said, we have to convince researchers to see open publishing as theirs, as it is something that benefits both them and society as a whole. What I think is most important is that: the awareness of researchers
Interview clips
1. Why should universities and researchers share their studies in open formats?
2. What requirements must an investigation meet in order to be considered open?
Did you know that data science skills are among the most in-demand skills in business? In this podcast, we are going to tell you how you can train yourself in this field, in a self-taught way. For this purpose, we will have two experts in data science:
- Juan Benavente, industrial and computer engineer with more than 12 years of experience in technological innovation and digital transformation. In addition, it has been training new professionals in technology schools, business schools and universities for years.
- Alejandro Alija, PhD in physics, data scientist and expert in digital transformation. In addition to his extensive professional experience focused on the Internet of Things (internet of things), Alejandro also works as a lecturer in different business schools and universities.
Listen to the full podcast (only available in Spanish)
Summary / Transcript of the interview
1. What is data science? Why is it important and what can it do for us?
Alejandro Alija: Data science could be defined as a discipline whose main objective is to understand the world, the processes of business and life, by analysing and observing data.Data science is a discipline whose main objective is to understand the world, the processes of business and life, by analysing and observing the data.. In the last 20 years it has gained exceptional relevance due to the explosion in data generation, mainly due to the irruption of the internet and the connected world.
Juan Benavente: The term data science has evolved since its inception. Today, a data scientist is the person who is working at the highest level in data analysis, often associated with the building of machine learning or artificial intelligence algorithms for specific companies or sectors, such as predicting or optimising manufacturing in a plant.
The profession is evolving rapidly, and is likely to fragment in the coming years. We have seen the emergence of new roles such as data engineers or MLOps specialists. The important thing is that today any professional, regardless of their field, needs to work with data. There is no doubt that any position or company requires increasingly advanced data analysis. It doesn't matter if you are in marketing, sales, operations or at university. Anyone today is working with, manipulating and analysing data. If we also aspire to data science, which would be the highest level of expertise, we will be in a very beneficial position. But I would definitely recommend any professional to keep this on their radar.
2. How did you get started in data science and what do you do to keep up to date? What strategies would you recommend for both beginners and more experienced profiles?
Alejandro Alija: My basic background is in physics, and I did my PhD in basic science. In fact, it could be said that any scientist, by definition, is a data scientist, because science is based on formulating hypotheses and proving them with experiments and theories. My relationship with data started early in academia. A turning point in my career was when I started working in the private sector, specifically in an environmental management company that measures and monitors air pollution. The environment is a field that is traditionally a major generator of data, especially as it is a regulated sector where administrations and private companies are obliged, for example, to record air pollution levels under certain conditions. I found historical series up to 20 years old that were available for me to analyse. From there my curiosity began and I specialised in concrete tools to analyse and understand what is happening in the world.
Juan Benavente: I can identify with what Alejandro said because I am not a computer scientist either. I trained in industrial engineering and although computer science is one of my interests, it was not my base. In contrast, nowadays, I do see that more specialists are being trained at the university level. A data scientist today has manyskills on their back such as statistics, mathematics and the ability to understand everything that goes on in the industry. I have been acquiring this knowledge through practice. On how to keep up to date, I think that, in many cases, you can be in contact with companies that are innovating in this field. A lot can also be learned at industry or technology events. I started in the smart cities and have moved on to the industrial world to learn little by little.
Alejandro Alija:. To add another source to keep up to date. Apart from what Juan has said, I think it's important to identify what we call outsiders, the manufacturers of technologies, the market players. They are a very useful source of information to stay up to date: identify their futures strategies and what they are betting on.
3. If someone with little or no technical knowledge wants to learn data science, where do they start?
Juan Benavente: In training, I have come across very different profiless: from people who have just graduated from university to profiles that have been trained in very different fields and find in data science an opportunity to transform themselves and dedicate themselves to this. Thinking of someone who is just starting out, I think the best thing to do is put your knowledge into practice. In projects I have worked on, we defined the methodology in three phases: a first phase of more theoretical aspects, taking into account mathematics, programming and everything a data scientist needs to know; once you have those basics, the sooner you start working and practising those skills, the better. I believe that skill sharpens the wit and, both to keep up to date and to train yourself and acquire useful knowledge, the sooner you enter into a project, the better. And even more so in a world that is so frequently updated. In recent years, the emergence of the Generative AI has brought other opportunities. There are also opportunities for new profiles who want to be trained . Even if you are not an expert in programming, you have tools that can help you with programming, and the same can happen in mathematics or statistics.
Alejandro Alija:. To complement what Juan says from a different perspective. I think it is worth highlighting the evolution of the data science profession.. I remember when that paper about "the sexiest profession in the world" became famous and went viral, but then things adjusted. The first settlers in the world of data science did not come so much from computer science or informatics. There were more outsiders: physicists, mathematicians, with a strong background in mathematics and physics, and even some engineers whose work and professional development meant that they ended up using many tools from the computer science field. Gradually, it has become more and more balanced. It is now a discipline that continues to have those two strands: people who come from the world of physics and mathematics towards the more basic data, and people who come with programming skills. Everyone knows what they have to balance in their toolbox. Thinking about a junior profile who is just starting out, I think a very important thing - and we see this when we teach - is programming skills. I would say that having programming skills is not just a plus, but a basic requirement for advancement in this profession. It is true that some people can do well without a lot of programming skills, but I would argue that a beginner needs to have those first programming skills with a basic toolset . We're talking about languages such as Python and R, which are the headline languages. You don't need to be a great coder, but you do need to have some basic knowledge to get started. Then, of course, specific training in the mathematical foundations of data science is crucial. The fundamental statistics and more advanced statistics are complements that, if present, will move a person along the data science learning curve much faster. Thirdly, I would say that specialisation in particular tools is important. Some people are more oriented towards data engineering, others towards the modelling world. Ideally, specialise in a few frameworks and use them together, as optimally as possible.
4. In addition to teaching, you both work in technology companies. What technical certifications are most valued in the business sector and what open sources of knowledge do you recommend to prepare for them?
Juan Benavente: Personally, it's not what I look at most, but I think it can be relevant, especially for people who are starting out and need help in structuring their approach to the problem and understanding it. I recommend certifications of technologies that are in use in any company where you want to end up working. Especially from providers of cloud computing and widespread data analytics tools. These are certifications that I would recommend for someone who wants to approach this world and needs a structure to help them. When you don't have a knowledge base, it can be a bit confusing to understand where to start. Perhaps you should reinforce programming or mathematical knowledge first, but it can all seem a bit complicated. Where these certifications certainly help you is, in addition to reinforcing concepts, to ensure that you are moving well and know the typical ecosystem of tools you will be working with tomorrow. It is not just about theoretical concepts, but about knowing the ecosystems that you will encounter when you start working, whether you are starting your own company or working in an established company. It makes it much easier for you to get to know the typical ecosystem of tools. Call it Microsoft Computing, Amazon or other providers of such solutions. This will allow you to focus more quickly on the work itself, and less on all the tools that surround it. I believe that this type of certification is useful, especially for profiles that are approaching this world with enthusiasm. It will help them both to structure themselves and to land well in their professional destination. They are also likely to be valued in selection processes.
Alejandro Alija: If someone listens to us and wants more specific guidelines, it could be structured in blocks. There are a series of massive online courses that, for me, were a turning point. In my early days, I tried to enrol in several of these courses on platforms such as Coursera, edX, where even the technology manufacturers themselves design these courses. I believe that this kind of massive, self-service, online courses provide a good starting base. A second block would be the courses and certifications of the big technology providers, such as Microsoft, Amazon Web Services, Google and other platforms that are benchmarks in the world of data. These companies have the advantage that their learning paths are very well structured, which facilitates professional growth within their own ecosystems. Certifications from different suppliers can be combined. For a person who wants to go into this field, the path ranges from the simplest to the most advanced certifications, such as being a data solutions architect or a specialist in a specific data analytics service or product. These two learning blocks are available on the internet, most of them are open and free or close to free. Beyond knowledge, what is valued is certification, especially in companies looking for these professional profiles.
5. In addition to theoretical training, practice is key, and one of the most interesting methods of learning is to replicate exercises step by step. In this sense, from datos.gob.es we offer didactic resources, many of them developed by you as experts in the project, can you tell us what these exercises consist of?. How are they approached?
Alejandro Alija: The approach we always took was designed for a broad audience, without complex prerequisites. We wanted any user of the portal to be able to replicate the exercises, although it is clear that the more knowledge you have, the more you can use it to your advantage. Exercises have a well-defined structure: a documentary section, usually a content post or a report describing what the exercise consists of, what materials are needed, what the objectives are and what it is intended to achieve. In addition, we accompany each exercise with two additional resources. The first resource is a code repository where we upload the necessary materials, with a brief description and the code of the exercise. It can be a Python notebook , a Jupyter Notebook or a simple script, where the technical content is. And then another fundamental element that we believe is important and that is aimed at facilitating the execution of the exercises. In data science and programming, non-specialist users often find it difficult to set up a working environment. A Python exercise, for example, requires having a programming environment installed, knowing the necessary libraries and making configurations that are trivial for professionals, but can be very complex for beginners. To mitigate this barrier, we publish most of our exercises on Google Colab, a wonderful and open tool. Google Colab is a web programming environment where the user only needs a browser to access it. Basically, Google provides us with a virtual computer where we can run our programmes and exercises without the need for special configurations. The important thing is that the exercise is ready to use and we always check it in this environment, which makes it much easier to learn for beginners or less technically experienced users.
Juan Benavente: Yes, we always take a user-oriented approach, step by step, trying to make it open and accessible. The aim is for anyone to be able to run an exercise without the need for complex configurations, focusing on topics as close to reality as possible. We often take advantage of open data published by entities such as the DGT or other bodies to make realistic analyses. We have developed very interesting exercises, such as energy market predictions, analysis of critical materials for batteries and electronics, which allow learning not only about technology, but also about the specific subject matter.. You can get down to work right away, not only to learn, but also to find out about the subject.
6. In closing, we'd like you to offer a piece of advice that is more attitude-oriented than technical, what would you say to someone starting out in data science?
Alejandro Alija: As for an attitude tip for someone starting out in data science, I suggest be brave. There is no need to worry about being unprepared, because in this field everything is to be done and anyone can contribute value. Data science is multi-faceted: there are professionals closer to the business world who can provide valuable insights, and others who are more technical and need to understand the context of each area. My advice is to be content with the resources available without panicking, because, although the path may seem complex, the opportunities are very high. As a technical tip, it is important to be sensitive to the development and use of data. The more understanding one has of this world, the smoother the approach to projects will be.
Juan Benavente: I endorse the advice to be brave and add a reflection on programming: many people find the theoretical concept attractive, but when they get to practice and see the complexity of programming, some are discouraged by lack of prior knowledge or different expectations. It is important to add the concepts of patience and perseverance. When you start in this field, you are faced with multiple areas that you need to master: programming, statistics, mathematics, and specific knowledge of the sector you will be working in, be it marketing, logistics or another field. The expectation of becoming an expert quickly is unrealistic. It is a profession that, although it can be started without fear and by collaborating with professionals, requires a journey and a learning process. You have to be consistent and patient, managing expectations appropriately. Most people who have been in this world for a long time agree that they have no regrets about going into data science. It is a very attractive profession where you can add significant value, with an important technological component. However, the path is not always straightforward. There will be complex projects, moments of frustration when analyses do not yield the expected results or when working with data proves more challenging than expected. But looking back, few professionals regret having invested time and effort in training and developing in this field. In summary, the key tips are: courage to start, perseverance in learning and development of programming skills.
Interview clips
1. Is it worth studying data science?
2. How are the data science exercises on datos.gob.es approached?
3. What is data science? What skills are required?
In this episode we will discuss artificial intelligence and its challenges, based on the European Regulation on Artificial Intelligence that entered into force this year. Come and find out about the challenges, opportunities and new developments in the sector from two experts in the field:
- Ricard Martínez, professor of constitutional law at the Universitat de València where he directs the Chair of Privacy and Digital Transformation Microsoft University of Valencia.
- Carmen Torrijos, computational linguist, expert in AI applied to language and professor of text mining at the Carlos III University.
Listen to the full podcast (only available in Spanish)
Summary / Transcript of the interview
1. It is a fact that artificial intelligence is constantly evolving. To get into the subject, I would like to hear about the latest developments in AI?
Carmen Torrijos: Many new applications are emerging. For example, this past weekend there has been a lot of buzz about an AI for image generation in X (Twitter), I don't know if you've been following it, called Grok. It has had quite an impact, not because it brings anything new, as image generation is something we have been doing since December 2023. But this is an AI that has less censorship, that is, until now we had a lot of difficulties with the generalist systems to make images that had faces of celebrities or had certain situations and it was very monitored from any tool. What Grok does is to lift all that up so that anyone can make any kind of image with any famous person or any well-known face. It is probably a passing fad. We will make images for a while and then it will pass.
And then there are also automatic podcast creation systems, such as Notebook LM. We've been watching them for a couple of months now and it's really been one of the things that has really surprised me in the last few months. Because it already seems that they are all incremental innovations: on top of what we already have, they give us something better. But this is something really new and surprising. You upload a PDF and it can generate a podcast of two people talking in a totally natural, totally realistic way about that PDF. This is something that Notebook LM, which is owned by Google, can do.
2. The European Regulation on Artificial Intelligence is the world's first legal regulation on AI, with what objectives is this document, which is already a reference framework at international level, being published?
Ricard Martínez: The regulation arises from something that is implicit in what Carmen has told us. All this that Carmen tells is because we have opened ourselves up to the same unbridled race that we experienced with the emergence of social media. Because when this happens, it's not innocent, it's not that companies are being generous, it's that companies are competing for our data. They gamify us, they encourage us to play, they encourage us to provide them with information, so they open up. They do not open up because they are generous, they do not open up because they want to work for the common good or for humanity. They open up because we are doing their work for them. What does the EU want to stop? What we learned from social media. The European Union has two main approaches, which I will try to explain very succinctly. The first approach is a systemic risk approach. The European Union has said: "I will not tolerate artificial intelligence tools that may endanger the democratic system, i.e. the rule of law and the way I operate, or that may seriously infringe fundamental rights". That is a red line.
The second approach is a product-oriented approach. An AI is a product. When you make a car, you follow rules that manage how you produce that car, and that car comes to market when it is safe, when it has all the specifications. This is the second major focus of the Regulation. The regulation says that you can be developing a technology because you are doing research and I almost let you do whatever you want. Now, if this technology is to come to market, you will catalogue the risk. If the risk is low or slight, you are going to be able to do a lot of things and, practically speaking, with transparency and codes of conduct, I will give you a pass. But if it's a high risk, you're going to have to follow a standardised design process, and you're going to need a notified body to verify that technology, make sure that in your documentation you've met what you have to meet, and then they'll give you a CE mark. And that's not the end of it, because there will be post-trade surveillance. So, throughout the life cycle of the product, you need to ensure that this works well and that it conforms to the standard.
On the other hand, a tight control is established with regard to big data models, not only LLM, but also image or other types of information, where it is believed that they may pose systemic risks.
In that case, there is a very direct control by the Commission. So, in essence, what they are saying is: "respect rights, guarantee democracy, produce technology in an orderly manner according to certain specifications".
Carmen Torrijos: Yes, in terms of objectives it is clear. I have taken up Ricard's last point about producing technology in accordance with this Regulation. We have this mantra that the US does things, Europe regulates things and China copies things. I don't like to generalise like that. But it is true that Europe is a pioneer in terms of legislation and we would be much stronger if we could produce technology in line with the regulatory standards we are setting. Today we still can't, maybe it's a question of giving ourselves time, but I think that is the key to technological sovereignty in Europe.
3. In order to produce such technology, AI systems need data to train their models. What criteria should the data meet in order to train an AI system correctly? Could open data sets be a source? In what way?
Carmen Torrijos: The data we feed AI with is the point of greatest conflict. Can we train with any dataset even if it is available? We are not talking about open data, but about available data.
Open data is, for example, the basis of all language models, and everyone knows this, which is Wikipedia. Wikipedia is an ideal example for training, because it is open, it is optimised for computational use, it is downloadable, it is very easy to use, there is a lot of language, for example, for training language models, and there is a lot of knowledge of the world. This makes it the ideal dataset for training an AI model. And Wikipedia is in the open, it is available, it belongs to everyone and it is for everyone, you can use it.
But can all the datasets available on the Internet be used to train AI systems? That is a bit of a doubt. Because the fact that something is published on the Internet does not mean that it is public, for public use, although you can take it and train a system and start generating profit from that system. It had a copyright, authorship and intellectual property. That I think is the most serious conflict we have right now in generative AI because it uses content to inspire and create. And there, little by little, Europe is taking small steps. For example, the Ministry of Culture has launched an initiative to start looking at how we can create content, licensed datasets, to train AI in a way that is legal, ethical and respectful of authors' intellectual property rights.
All this is generating a lot of friction. Because if we go on like this, we will turn against many illustrators, translators, writers, etc. (all creators who work with content), because they will not want this technology to be developed at the expense of their content. Somehow you have to find a balance in regulation and innovation to make both happen. From the large technological systems that are being developed, especially in the United States, there is a repeated idea that only with licensed content, with legal datasets that are free of intellectual property, or that the necessary returns have been paid for their intellectual property, it is not possible to reach the level of quality of AIs that we have now. That is, only with legal datasets alone we would not have ChatGPT at the level ChatGPT is at now.
This is not set in stone and does not have to be the case. We have to continue researching, that is, we have to continue to see how we can achieve a technology of that level, but one that complies with the regulation. Because what they have done in the United States, what GPT-4 has done, the great models of language, the great models of image generation, is to show us the way. This is as far as we can go. But you have done so by taking content that is not yours, that it was not permissible to take. We have to get back to that level of quality, back to that level of performance of the models, respecting the intellectual property of the content. And that is a role that I believe is primarily Europe's responsibility.
4. Another issue of public concern with regard to the rapid development of AI is the processing of personal data. How should they be protected and what conditions does the European regulation set for this?
Ricard Martínez: There is a set of conducts that have been prohibited essentially to guarantee the fundamental rights of individuals. But it is not the only measure. I attach a great deal of importance to an article in the regulation that we are probably not going to give much thought to, but for me it is key. There is an article, the fourth one, entitled AI Literacy, which says that any subject that is intervening in the value chain must have been adequately trained. You have to know what this is about, you have to know what the state of the art is, you have to know what the implications are of the technology you are going to develop or deploy. I attach great value to it because it means incorporating throughout the value chain (developer, marketer, importer, company deploying a model for use, etc.) a set of values that entail what is called accountability, proactive responsibility, by default. This can be translated into a very simple element, which has been talked about for two thousand years in the world of law, which is 'do no harm', the principle of non-maleficence.
With something as simple as that, "do no harm to others, act in good faith and guarantee your rights", there should be no perverse effects or harmful effects, which does not mean that it cannot happen. And this is precisely what the Regulation says in particular when it refers to high-risk systems, but it is applicable to all systems. The Regulation tells you that you have to ensure compliance processes and safeguards throughout the life cycle of the system. That is why it is so important to have robustness, resilience and contingency plans that allow you to revert, shut down, switch to human control, change the usage model when an incident occurs.
Therefore, the whole ecosystem is geared towards this objective of no harm, no rights, no harm. And there is an element that no longer depends on us, it depends on public policy. AI will not only infringe on rights, it will change the way we understand the world. If there are no public policies in the education sector that ensure that our children develop computational thinking skills and are able to have a relationship with a machine-interface, their access to the labour market will be significantly affected. Similarly, if we do not ensure the continuous training of active workers and also the public policies of those sectors that are doomed to disappear.
Carmen Torrijos: I find Ricard's approach of to train is to protect very interesting. Train people, inform people, get people trained in AI, not only people in the value chain, but everybody. The more you train and empower, the more you are protecting people.
When the law came out, there was some disappointment in AI environments and especially in creative environments. Because we were in the midst of the generative AI boom and generative AI was hardly being regulated, but other things were being regulated that we took for granted would not happen in Europe, but that have to be regulated so that they cannot happen. For example, biometric surveillance: Amazon can't read your face to decide whether you are sadder that day and sell you more stuff or get more advertising or a particular advertisement. I say Amazon, but it can be any platform. This, for example, will not be possible in Europe because it is forbidden by law, it is an unacceptable use: biometric surveillance.
Another example is social scoring, the social scoring that we see happening in China, where citizens are given points and access to public services based on these points. That is not going to be possible either. And this part of the law must also be considered, because we take it for granted that this is not going to happen to us, but when you don't regulate it, that's when it happens. China has installed 600 million TRF cameras, facial recognition technology, which recognise you with your ID card. That is not going to happen in Europe because it cannot, because it is also biometric surveillance. So you have to understand that the law perhaps seems to be slowing down on what we are now enraptured by, which is generative AI, but it has been dedicated to addressing very important points that needed to be covered in order to protect people. In order not to lose fundamental rights that we have already won.
Finally, ethics has a very uncomfortable component, which nobody wants to look at, which is that sometimes it has to be revoked. Sometimes it is necessary to remove something that is in operation, even that is providing a benefit, because it is incurring some kind of discrimination, or because it is bringing some kind of negative consequence that violates the rights of a collective, of a minority or of someone vulnerable. And that is very complicated. When we have become accustomed to having an AI operating in a certain context, which may even be a public context, to stop and say that this is discriminating against people, then this system cannot continue in production and has to be removed. This point is very complicated, it is very uncomfortable and when we talk about ethics, which we talk very easily about ethics, we must also think about how many systems we are going to have to stop and review before we can put them back into operation, however easy they make our lives or however innovative they may seem.
5. In this sense, taking into account all that the Regulation contains, some Spanish companies, for example, will have to adapt to this new framework. What should organisations already be doing to prepare? What should Spanish companies review in the light of the European regulation?
Ricard Martínez: This is very important, because there is a corporate business level of high capabilities that I am not worried about because these companies understand that we are talking about an investment. And just as they invested in a process-based model that integrated the compliance from the design for data protection. The next leap, which is to do exactly the same thing with artificial intelligence, I won't say that it is unimportant, because it is of relevant importance, but let's say that it is going down a path that has already been tried. These companies already have compliance units, they already have advisors, and they already have routines into which the artificial intelligence regulatory framework can be integrated as part of the process. In the end, what it will do is to increase risk analysis in one sense. It will surely force the design processes and also the design phases themselves to be modular, i.e., while in software design we are practically talking about going from a non-functional model to chopping up code, here there are a series of tasks of enrichment, annotation, validation of the data sets, prototyping that surely require more effort, but they are routines that can be standardised.
My experience in European projects where we have worked with clients, i.e. SMEs, who expect AI to be plug and play, what we have seen is a huge lack of capacity building. The first question you should ask yourself is not whether your company needs AI, but whether your company is ready for AI. This is an earlier and rather more relevant question. Hey, you think you can make a leap into AI, that you can contract a certain type of services, and we are realising that you don't even comply with the data protection regulation.
There is something, an entity called the Spanish Agency for Artificial Intelligence, AESIA, and there is a Ministry of Digital Transformation, and if there are no accompanying public policies, we may incur risky situations. Why? Because I have the great pleasure of training future entrepreneurs in artificial intelligence in undergraduate and postgraduate courses. When confronted with the ethical and legal framework, I won't say they want to die, but the world comes crashing down on them. Because there is no support, there is no accompaniment, there are no resources, or they cannot see them, that do not involve a round of investment that they cannot bear, or there are no guided models that help them in a way that is, I won't say easy, but at least usable.
Therefore, I believe that there is a substantial challenge in public policies, because if this combination does not happen, the only companies that will be able to compete are those that already have a critical mass, an investment capacity and an accumulated capital that allows them to comply with the standard. This situation could lead to a counterproductive outcome.
We want to regain European digital sovereignty, but if there are no public investment policies, the only ones who will be able to comply with the European standard are companies from other countries.
Carmen Torrijos: Not because they are from other countries but because they are bigger.
Ricard Martínez: Yes, not to mention countries.
6. We have talked about challenges, but it is also important to highlight opportunities. What positive aspects could you highlight as a result of this recent regulation?
Ricard Martínez: I am working on the construction, with European funding, of Cancer Image EU , which is intended to be a digital infrastructure for cancer imaging. At the moment, we are talking about a partnership involving 14 countries, 76 organisations, on the way to 93, to generate a medical imaging database of 25 million cancer images with associated clinical information for the development of artificial intelligence. The infrastructure is being built, it does not yet exist, and even so, at the Hospital La Fe in Valencia, research is already underway with mammograms of women who have undergone biennial screening and then developed cancer, to see if it is capable of training an image analysis model that is capable of preventively recognising that little spot that the oncologist or radiologist did not see and that later turned out to be a cancer. Does it mean you're getting chemotherapy five minutes later? No. It means they are going to monitor you, they are going to have an early reaction capability. And that the health system will save 200,000 euros. To mention just one opportunity.
On the other hand, opportunities must also be sought in other rules. Not only in the Artificial Intelligence Regulation. You have to go to Data Governance Act. It wants to counter the data monopoly held by US companies with a sharing of data from the public, private sectorand from the citizenry itself. With Data Act, which aims to empower citizens to retrieve their data and share it by consent. And finally with the European Health Data Space which aims to create ahealth data ecosystem to promote innovation, research and entrepreneurship. It is this ecosystem of data spaces that should be a huge generator of opportunity spaces.
And furthermore, I don't know whether they will succeed or not, but it aims to be coherent with our business ecosystem. That is to say, an ecosystem of small and medium-sized enterprises that does not have high data generation capabilities and what we are going to do is to build the field for them. We are going to create the data spaces for them, we are going to create the intermediaries, the intermediation services, and we hope that this ecosystem as a whole will allow European talent to emerge from small and medium-sized enterprises. Will it be achieved or not? I don't know, but the opportunity scenario looks very interesting.
Carmen Torrijos: If you ask for opportunities, all opportunities. Not only artificial intelligence, but all technological progress, is such a huge field that it can bring opportunities of all kinds. What needs to be done is to lower the barriers, which is the problem we have. And we also have barriers of many kinds, because we have technical barriers, talent barriers, salary barriers, disciplinary barriers, gender barriers, generational barriers, and so on.
We need to focus energies on lowering those barriers, and then I also think we still come from the analogue world and have little global awareness that both digital and everything that affects AI and data is a global phenomenon. There is no point in keeping it all local, or national, or even European, but it is a global phenomenon. The big problems we have come because we have technology companies that are developed in the United States working in Europe with European citizens' data. A lot of friction is generated there. Anything that can lead to something more global will always be in favour of innovation and will always be in favour of technology. The first thing is to lift the barriers within Europe. That is a very positive part of the law.
7. At this point, we would like to take a look at the state we are in and the prospects for the future. How do you see the future of artificial intelligence in Europe?
Ricard Martínez: I have two visions: one positive and one negative. And both come from my experience in data protection. If now that we have a regulatory framework, the regulatory authorities, I am referring to artificial intelligence and data protection, are not capable of finding functional and grounded solutions, and they generate public policies from the top down and from an excellence that does not correspond to the capacities and possibilities of research - I am referring not only to business research, but also to university research - I see a very dark future. If, on the other hand, we understand regulation in a dynamic way with supportive and accompanying public policies that generate the capacities for this excellence, I see a promising future because in principle what we will do is compete in the market with the same solutions as others, but responsive: safe, responsible and reliable.
Carmen: Yes, I very much agree. I introduce the time variable into that, don't I? Because I think we have to be very careful not to create more inequality than we already have. More inequality among companies, more inequality among citizens. If we are careful with this, which is easy to say but difficult to do, I believe that the future can be bright, but it will not be bright immediately. In other words, we are going to have to go through a darker period of adapting to change. Just as many issues of digitalisation are no longer alien to us, have already been worked on, we have already gone through them and have already regulated them, artificial intelligence also needs its time.
We have had very few years of AI, very few years of generative AI. In fact, two years is nothing in a worldwide technological change. And we have to give time to laws and we also have to give time for things to happen. For example, I give a very obvious example, the New York Times' complaint against Microsoft and OpenAI has not yet been resolved. It's been a year, it was filed in December 2023, the New York Times complains that they have trained AI systems with their content and in a year nothing has been achieved in that process. Court proceedings are very slow. We need more to happen. And that more processes of this type are resolved in order to have precedents and to have maturity as a society in what is happening, and we still have a long way to go. It's like almost nothing has happened. So, the time variable I think is important and I think that, although at the beginning we have a darker future, as Ricard says, I think that in the long term, if we keep clear limits, we can reach something brilliant.
Interview clips
1. What criteria should the data have to train an AI system?
2. What should Spanish companies review in light of the IA Regulation?
This episode focuses on data governance and why it is important to have standards, policies and processes in place to ensure that data is correct, reliable, secure and useful. For this purpose, we analyze the Model Ordinance on Data Governance of the Spanish Federation of Municipalities and Provinces, known as the FEMP, and its application in a public body such as the City Council of Zaragoza. This will be done by the following guests:
- Roberto Magro Pedroviejo, coordinator of the Open Data Working Group of the Network of Local Entities for Transparency and Citizen Participation of the Spanish Federation of Municipalities and Provinces and civil servant of the Alcobendas City Council.
- María Jesús Fernández Ruiz, head of the Technical Office of Transparency and Open Government of Zaragoza City Council.
Listen to the full podcast (only available in Spanish)
Summary / Transcript of the interview
1. What is data governance?
Roberto Magro Pedroviejo: We, in the field of Public Administrations, define data governance as an organisational and technical mechanism that comprehensively addresses issues related to the use of data in our organisation. It covers the entire data lifecycle, i.e. from creation to archiving or even, if necessary, purging and destruction. Its purpose is that data is of quality and available to all those who need it: sometimes it will be only the organisation itself internally, but many other times it will be the general public, re-users, the university environment, etc. Data governance must facilitate the right of access to data. In short, data governance makes it possible to respond to the objective of managing our administration effectively and efficiently and achieving greater interoperability between all administrations.
2. Why is this concept important for a municipality?
María Jesús Fernández Ruiz: Because we have found that, within organisations, both public and private, data collection and management is often carried out without following homogeneous criteria, standards or appropriate techniques. This translates into a difficult and costly situation, which is exacerbated when we try to develop a data space or develop data-related services. Therefore, we need an umbrella that obliges us to manage data, as Roberto has said, effectively and efficiently, following homogeneous standards and criteria, which facilitates interoperability.
3. To meet this challenge, it is necessary to establish a set of guidelines to help local administrations set up a legal framework. For this reason, the FEMP Model Ordinance on Data Governance has been created. What was the process of developing this reference document like?
Roberto Magro Pedroviejo: Within the Open Data Network Group that was created back in 2017, one of the people we have counted on and who has contributed a lot of ideas has been María Jesús, from Zaragoza City Council. We were leaving COVID, just in March 2021, and I remember perfectly the meeting we had in a room lent to us by the Madrid City Council in the Cibeles Palace. María Jesús was in Zaragoza and joined the meeting by videoconference. On that day, seeing what things and what work we could tackle within this multidisciplinary group, María Jesús proposed creating a model ordinance. The FEMP and the Network already had experience in creating model ordinances to try to improve, and above all help, municipalities and local entities or councils to create regulations.
We started working as a multidisciplinary team, led by José Félix Muñoz Soro, from the University of Zaragoza, who is the person who has coordinated the regulatory text that we have published. And a few months later, in January 2022 to be precise, we held a meeting. We met in person at the Zaragoza City Council and there we began to establish the basis for the model ordinance, what type of articles it should have, what type of structure it should have, etc. And we got together a multidisciplinary team, as we said, which included experts in data governance and jurists from the University of Zaragoza, staff from the Polytechnic University of Madrid, colleagues from the Polytechnic University of Valencia, professionals from the local public sphere and journalists who are experts in open data.
The first draft was published in May/June 2022. In addition, it was made available for public consultation through Zaragoza City Council's Citizen Participation platform. We contacted around 100 national experts and received around 30 contributions of improvements, most of which were included, and which allowed us to have the final text by the end of last year, which was passed to the legal department of the FEMP to validate it. The regulations were published in February 2024 and are now available on the Network's website for free download.
I would like to take this opportunity to thank the excellent work done by all the people involved in the team who, from their respective points of view, have worked selflessly to create this knowledge and share it with all the Spanish public administrations.
4. What are the expected benefits of the ordinance?
María Jesús Fernández Ruiz: For me, one of the main objectives of the ordinance, and I think it is a great instrument, is that it takes the whole life cycle of the data. It covers from the moment the data is generated, how the data is managed, how the data is provided, how the documentation associated with the data must be stored, how the historical data must be stored, etc. The most important thing is that it establishes criteria for managing the data while respecting its entire life cycle.
The ordinance also establishes some principles, which are not many, but which are very important and which set the tone, which speak, for example, of effective data governance and describe the importance of establishing processes when generating the data, managing the data, providing the data, etc.
Another very important principle, which has been mentioned by Roberto, is the ethical treatment of data. In other words, the importance of collecting data traceability, of seeing where the data is moving and of respecting the rights of natural and legal persons.
Another very important principle that generates a lot of noise in the institutions is that data must be managed from the design phase, the management of data by default. Often, when we start working on data with openness criteria, we are already in the middle or near the end of the data lifecycle. We have to design data management from the beginning, from the source. This saves us a lot of resources, both human and financial.
Another important issue for us and one that we advocate within the ordinance is that administration has to be data-oriented. It has to be an administration that is going to design its policies based on evidence. An administration that will consider data as a strategic asset and will therefore provide the necessary resources.
And another issue, which we often discuss with Roberto, is the importance of data culture. When we work on and publish data, data that is interoperable, that is easy to reuse, that is understood, etc., we cannot stop there, but we must talk about the data culture, which is also included in the ordinance. It is important that we disseminate what is data, what is quality data, how to access data, how to use data. In other words, every time we publish a dataset, we must consider actions related to data culture.
5. Zaragoza City Council has been a pioneer in the application of this ordinance. What has this implementation process been like and what challenges are you facing?
María Jesús Fernández Ruiz: This challenge has been very interesting and has also helped us to improve. It was very fast at the beginning and already in June we were going to present the ordinance to the city government. There is a process where the different parties make private votes on the ordinance and say "this point I like", "this point seems more interesting", "this one should be modified", etc. Our surprise is that we have had more than 50 private votes on the ordinance, after having gone through the public consultation process and having appeared in all the media, which was also enriching, and we have had to respond to these votes. The truth is that it has helped us to improve and, at the moment, we are waiting for it to go to government.
When they tell me how do you feel, María Jesús? The answer is well, we are making progress, because thanks to this ordinance, which is pending approval by the Zaragoza City Council government, we have already issued a series of contracts. One that is extremely important for us: to draw up an inventory of data and information sources in our institution, which we believe is the basic instrument for managing data, knowing what data we have, where they originate, what traceability they have, etc. Therefore, we have not stopped. Thanks to this framework that has not yet been approved, we have been able to make progress on the basis of contracts or something that is basic in an institution: the definition of the professionals who have to participate in data management.
6. You mentioned the need to develop an inventory of datasets and information sources, what kind of datasets are we talking about and what descriptive information should be included for each?
Roberto Magro Pedroviejo: There is a core, let's say a central core, with a series of datasets that we recommend in the ordinance itself, referring to other work done in the open data group, which is to recommend 80 datasets that we could publish in Spanish public administrations. The focus is also on high-value datasets, those that can most benefit municipal management or can benefit by providing social and economic value to the general public and to the business community and reusers. Any administration that wants to start working on the issue of datasets and wonders where to start publishing or managing data has to focus, in my view, on three key areas in a city:
- The personal data, i.e. our beloved census: who are the people living in our city, their ages, gender, postal addresses, etc.
- The urban and territorial data, that is, where these people live, what the territorial delimitation of the municipality is, etc. Everything that has to do with these sets of data related to streets, roads, even sewerage, public roads or lighting, needs to be inventoried, to know where these data are and to have them, as we have already said, updated, structured, accessible, etc.
- And finally, everything that has to do with how the city is managed, of course, with the tax and budget area.
That is: the personal sphere, the territorial sphere and the taxation sphere. That is what we recommend to start with. And in the end, this inventory of datasets describes what they are, where they are, how they are and will be the first basis on which to start building data governance.
María Jesús Fernández Ruiz: Another issue that is also very fundamental, which is included in the ordinance, is to define the master datasets. Just a little anecdote. When creating a spatial data space, the street map, the base cartography and the portal holder are basic. When we got together to work, a technical commission was set up and we considered these to be master datasets for Zaragoza City Council. The quality of the data is determined by a concept in the ordinance, which is respecting the sovereignty of the data: whoever creates the data is the sovereign of the data and is responsible for the quality of the data. Sovereignty must be respected and that determines quality.
We then discovered that, in Zaragoza City Council, we had five different portal identifiers. To improve this situation, we define a descriptive unique identifier which we declare as master data. In this way, all municipal entities will use the same identifier, the same street map, the same cartography, etc. and this will make all services related to the city interoperable.
7. What additional improvements do you think could be included in future revisions of the ordinance?
Roberto Magro Pedroviejo: The ordinance itself, being a regulatory instrument, is adapted to current Spanish and European regulations. In other words, we will have to be very vigilant -we are already - to everything that is being published on artificial intelligence, data spaces and open data. The ordinance will have to be adapted because it is a regulatory framework to comply with current legislation, but if that regulatory framework changes, we will make the appropriate modifications for compliance.
I would also like to highlight two things. There have been more town councils and a university, specifically the Town Council of San Feliu de Llobregat and the University of La Laguna, interested in the ordinance. We have received more requests to know a little more about the ordinance, but the bravest have been the Zaragoza City Council, who were the ones who proposed it and are the ones who are suffering the process of publication and final approval. From this experience that Zaragoza City Council itself is gaining, we will surely all learn, about how to tackle it in each of the administrations, because we copy each other and we can go faster. I believe that, little by little, once Zaragoza publishes the ordinance, other city councils and other institutions will join in. Firstly, because it helps to organise the inside of the house. Now that we are in a process of digital transformation that is not fast, but rather a long process, this type of ordinance will help us, above all, to organise the data we have in the administration. Data and the management of data governance will help us to improve public management within the organisation itself, but above all in terms of the services provided to citizens.
And the last thing I wanted to emphasise, which is also very important, is that, if the data is not of high quality, is not updated and is not metadata-driven, we will do little or nothing in the administration from the point of view of artificial intelligence, because artificial intelligence will be based on the data we have and if it is not correct or updated, the results and predictions that AI can make will be of no use to us in the public administration.
María Jesús Fernández Ruiz: What Roberto has just said about artificial intelligence and quality data is very important. And I would like to add two things that we are learning in implementing this ordinance. Firstly, the need to define processes, i.e. efficient data management has to be based on processes. And another thing that I think we should talk about, and we will talk about within the FEMP, is the importance of defining the roles of the different professionals involved in data management. We are talking about data manager, data provider, technology provider, etc. If I had the ordinance now, I would talk about that definition of the roles that have to be involved in efficient data management. That is, processes and professionals.
Interview clips
Clip 1. What is data governance?
Clip 2. What is the FEMP Model Ordinance on Data Governance?