Entrevista

In this podcast we talk about transport and mobility data, a topic that is very present in our day-to-day lives. Every time we consult an application to find out how long a bus will take, we are taking advantage of open data linked to transport. In the same way, when an administration carries out urban planning or optimises traffic flows, it makes use of mobility data.

To delve into the challenges and opportunities behind the opening of this type of data by Spanish public administrations, we have two exceptional guests:

  • Tania Gullón Muñoz-Repiso, director of the Division of Transport Studies and Technology of the Ministry of Transport and Sustainable Mobility. Welcome, Tania!
  • Alicia González Jiménez, deputy director in the General Subdirectorate of Cartography and Observation of the Territory of the National Geographic Institute.

Listen here the full episode (in Spanish)

Summary of the interview

  1. Both the IGN and the Ministry generate a large amount of data related to transport. Of all of them, can you tell us which data and services are made available to the public as open data?

Alicia González: On the part of the National Geographic Institute, I would say that everything: everything we produce is available to users, because since the end of 2015 the dissemination policy adopted by the General Directorate of the National Geographic Institute, through the Autonomous Organism National Center for Geographic Information (CNIG), which is where all products and services are distributed, is an open data policy, so that everything is distributed under the CC BY 4.0 license, which protects free and open use. You simply have to make an attribution, a mention of the origin of the data. So we are talking, in general, not only about transport, but about all kinds of data, about more than 100 products that represent more than two and a half million files that users are increasingly demanding. In fact, in 2024 we have had up to 20 million files downloaded, so it is in high demand. And specifically in terms of transmission networks, the fundamental set of data is the Geographic Reference Information of Transport Networks (IGR-RT). It is a multimodal geospatial dataset that is composed of five transport networks that are continuous throughout the national territory and also interconnected. Specifically, it contemplates:

1. The road network that is made up of the entire road network, regardless of its owner and that runs throughout the territory. There are more than 300 thousand kilometers of road that are also connected to all the street maps, to the urban road network of all population centers. That is, we have a road graph that backbones the entire territory, in addition to having connected the roads that are later distributed and disseminated in the National Topographic Map.

2. The second most important network is the rail transport network. It includes all the data of rail transport and also of metro, tram and other types of modes by rail.

3 and 4. In the maritime and air field, the networks are already limited to infrastructures, so that they contain all the ports on the Spanish coast and all the infrastructures of aerodromes, airports, heliports in the air part.

5. And finally, the last network, which is much more modest, is residual data: cable transport.

Everything is interconnected through intermodal relationships. It is a set of data that is generated from official sources. We cannot incorporate just any data, it must always be official data and it is generated within the framework of cooperation of the National Cartographic System.

As a dataset that complies with the INSPIRE Directive both in its definition and in the way it is disseminated through standard web services, it has also been classified as a high-value dataset in the mobility category, in accordance with the  High-Value Data Enforcement Regulation. It is a fairly important and normalized set.

How can it be located and accessed? Precisely, as it is standard, it is catalogued in the IDE  (Spatial Data Infrastructure) catalogue, thanks to the standard description of its metadata. It can also be located through the official INSPIRE (Information Publication Services) data and services catalog  or is accessible through portals as relevant as the open data portal.

Once we have located it, how can the user access it? How can they see the data? There are several ways. The easiest: check your visualizer. All the data is displayed there and there are certain query tools to facilitate its use. And then, of course, through the CNIG download centre. There we publish all the data from all the networks and it is in great demand. And then the last way is to consult the standard web services that we generatevisualization services and downloads  of different technologies. In other words, it is a set of data that is available to users for reuse.

Tania Gullón: In the Ministry we also share a lot of open data. I would like, in order not to take too long, to comment in particular on four large sets of data:

1. The first would be the OTLE, the Observatory of Transport and Logistics in Spain, which is an initiative of the Ministry of Transport, whose main objective is to provide a global and comprehensive vision of the situation of transport and logistics in Spain. It is organized into seven blocks: mobility, socio-economy, infrastructure, security, sustainability, metropolitan transport and logistics. These are not georeferenced data, but statistical data. The Observatory makes data, graphs, maps, indicators available to the public and, not only that, but also offers annual reports, monographs, conferences, etc. And also of the observatories that we have cross-border, which are done collaboratively with Portugal and France.

2. The second set of data I want to mention is the NAP, the National Multimodal Transport Access Point, which is an official digital platform managed by the Ministry of Transport, but which is developed collaboratively between the different administrations. Its objective is to centralise and publish all the digitised information on the passenger transport offer in the national territory of all modes of transport. What do we have here? All schedules, services, routes, stops of all transport services, road transport, urban, intercity, rural, discretionary buses on demand. There are 116 datasets. The one of rail transport, the schedules of all those trains, their stops, etc. Also of maritime transport and air transport. And this data is constantly updated in real time. To date, we only have static data in GTFS (General Transit Feed Specification) format, which can also be reused and in a standard format that is useful for the further development of mobility applications by reusers. And while this NAP initially focused on static data, such as those routes, schedules, and stops, progress is being made toward incorporating dynamic data as well. In fact, in December we also have an obligation under European regulations that oblige us to have this data in real time to, in the end, improve all that transport planning and the user experience.

3. The third dataset is Hermes. It is the geographic information system of the general interest transport network. What is its objective? To offer a comprehensive vision, in this case georeferenced. Here I want to refer to what my colleague Alicia has commented, so that you can see how we are all collaborating with each other. We are not inventing anything, but everything is projected on those axes of the roads, for example, RT, the geographical reference information of the transport network. And what is done is to add all these technical parameters, as an added value to have a complete, comprehensive, multimodal information system for roads, railways, ports, airports, railway terminals and also waterways. It is a GIS (Geographic Information System), which allows all this analysis, not only downloading, consulting, with those open web services that we put at the service of citizens, but also in an open data catalog made with CKAN, which I will comment on later. Well, in the end there are more than 300 parameters that can be consulted. What are we talking about? For each section of road, the average traffic intensity, the average speed, the capacity of the infrastructures, planned actions are also known -not only the network in service, but also the planned network, the actions that the Ministry plans to carry out-, the ownership of the road, the lengths, speeds, accidents... well, many parameters, modes of access, co-financed projects, alternative fuels issues, the trans-European transport network, etc. That's the third of the datasets.

4. The fourth set is perhaps the largest because it is 16 GB per day. This is the project we call Big Data Mobility. This project is a pioneering initiative that uses Big Data and artificial intelligence technologies to analyze in depth the mobility patterns in the country is mainly based on the analysis of the anonymized mobile phone records of the population to obtain detailed information on all the movements of people not individualized, but aggregated at the census district level. Since 2020, a daily mobility study has been carried out and all this data is given openly. That is mobility by hours, by origin / destination that allows us to monitor and evaluate the demand for transport to plan improvements in those infrastructures and services. In addition, as data is given in open space, it can be used for any purpose, for tourism purposes, for research...

  1. How is this data generated and collected? What challenges do you have to face in this process and how do you solve them?

Alicia González: Specifically, in the field of products that are technologically generated in geographic information system environments and geospatial databases, in the end these are projects in which the fundamental basis is the capture of data and the integration of existing reference sources. When we see that the headline has a piece of information, that is the one that must be integrated. In summary, in the main technical works, the following could be identified:

  • On the one hand, capture, that is, when we want to store a geographical object we have to digitize it, draw it. Where? On an appropriate metric basis such as the aerial orthophotographs of the National Plan of Aerial Orthophotography (PNOA), which is also another dataset that is available and open. Well, when we have, for example, to draw or digitize a road, we trace it on that aerial image that PNOA provides us.
  • Once we have captured that geometric component, we have to provide it with an attribution and not just any data will do, they have to be official sources. So, we have to locate who is the owner of that infrastructure or who is the provider of the official data to detect what the attributes are, the characterization that we want to give to that information, which in principle was only geometric. To do this, we have to carry out a series of source validation processes, detect that it does not have incidents and processes that we call integration, which are quite complex to guarantee that the result meets what we want.
  • And finally, a fundamental phase in all these projects is the assurance of geometric and semantic quality. In other words, a series of quality controls must be developed and executed to validate the product, the final result of that integration and confirm that it meets the requirements indicated in the product specification.

In terms  of challenges, a fundamental challenge is data governance, that is, the result that is generated is fed from certain sources, but in the end the result is created. Then you have to define the role of each provider that may later later be a user. Another challenge in this whole process is locating data providers. Sometimes the person responsible for that infrastructure or the object that we want to store in the database does not publish the information in a standardized way or it is difficult to locate because it is not in a catalog. Sometimes it is difficult to locate the official source you need to complete the geographical information. And looking a little at the user, I would highlight that another challenge is to identify, to have the agility to identify in a flexible and fast way the use cases that are changing with users, who are demanding us, because in the end it is about continuing to be relevant to society. Finally, and because the Geographic Institute is a scientific and technical environment and this part affects us a lot, another challenge is digital transformation, that is, we are working on technological projects, so we also have to have a lot of capacity to manage change and adapt to new technologies.

Tania Gullón: Regarding how data is generated and collected and the challenges we face, for example, the NAP, the National Access Point for Multimodal Transport, is a collaborative generation, that is, here the data comes from the autonomous communities themselves, from the consortia and from the transport companies. The challenge is that there are many autonomous communities that are not yet digitized, there are many companies... The digitalisation of the sector is going slowly – it is going, but it is going slowly. In the end there is incomplete data, duplicate data. Governance is not yet well defined. It happens to us that, imagine, the company ALSA raises all its buses, but it has buses in all the autonomous communities. And if at the same time the autonomous community uploads its data, that data is duplicated. It's as simple as that. It is true that we are just starting and that governance is not yet well defined, so that there is no excess data. Before they were missing and now there are almost too many.

In Hermes, the geographic information system, what is done, as I said, is to project it on the information of the transport networks, which is the official one that Alicia mentioned, and data from the different managers and administrators of infrastructures are integrated, such as Adif, Puertos del Estado, AENA, the General Directorate of Roads,  ENAIRE, etc. What is the main challenge - if you had to stand out, because we can talk about this for an hour? It has cost us a lot, we have been working on this project for seven years and it has cost a lot because, first, people did not believe it. They didn't think it was going to work and they didn't collaborate. In the end, all this is knocking on the door of Adif, of AENA and changing that awareness in which data cannot be in a drawer, but must all be put at the service of the common good. And I think that's what has cost us a little more. In addition, there is the issue of governance, which Alicia has already commented on. You go to ask for a piece of information and in the organization itself they do not know who is the owner of that data, because perhaps the traffic data is handled by different departments. And who owns it? All this is very important.

We have to say that Hermes has been the great promoter of the Data offices, of the Adif Data office. In the end they have realized that what they needed was to put their house in order, as well as in everyone's house and in the Ministry as well, that Data offices are needed.

In the Big Data project, how is the data generated? In this case it is completely different. It is a pioneering project, more of new technologies, in which data is generated from anonymized mobile phone records. So, by reconstructing all that large amount of Big Data data, of the records that are in each antenna in Spain, with artificial intelligence and with a series of algorithms, these matrices are reconstructed and made. Then, those data from that sample – in the end we have a sample of 30% of the population, of more than 13 million mobile lines – is extrapolated with open data from the INE. And then, what do we do as well? It is calibrated with external sources, that is, with sources of certain reference, such as AENA ticketing, flights, Renfe data, etc. We calibrate this model to be able to generate these matrices with quality. The challenges: that it is very experimental. To give you an idea, we are the only country that has all this data. So we have been opening a gap and learning along the way. The difficulty is, again, the data. That data to calibrate, it is difficult for us to find it and to be given it with a certain periodicity and so on, because this goes in real time and we permanently need that flow of data. Also the adaptation to the user, as Alicia has said. We must adapt to what society and the reusers of this Big Data are demanding. And  we must also keep pace, as Alicia said, with technology, which is not the same as the telephony data that exists now as it was two years ago. And the great challenge of quality control. But well, here I think I'm going to leave Alicia, who is the super expert, to explain to us what mechanisms exist to ensure that the data are reliable and updated and comparable. And then I will give you my vision, if you like.

Alicia González: How can reliability, updating and comparison be guaranteed? I don't know if reliability can be guaranteed, but I think there may be a couple of indicators that are especially relevant. One is the degree to which a set of data conforms to the regulations that concern it. In the field of geographic information, the way of working is always standardized, that is, there is a family of ISO 19100 on Geographic Information/Geomatics or the INSPIRE Directive itself, which greatly conditions the way of working and publishing data. And also, looking at the public administration, I think that the official seal should also be a guarantee of reliability. In other words, when we process data we must do so in a homogeneous and unbiased way, while perhaps, perhaps, a private company may be conditioned by them. I believe that these two parameters are important, that they can indicate reliability.

In terms of the degree of updating and comparison of the data, I believe that the user deduces this information from the metadata. The metadata at the end is the cover letter for the datasets. So, if a dataset is correctly and truthfully metadatad, and if it is also made according to standard profiles – the same in the GEO field, since we are talking about the INPIRE or GeoDCAT-AP profile – if different datasets are defined in their metadata according to these standardized profiles, it is much easier to see if they are comparable and the user can determine and decide if it finally satisfies their update and comparability with another dataset. 

Tania Gullón: Totally Alicia. And if you allow me to add, we, for example, in Big Data have always been very committed to measuring quality – more so when they are new technologies that, at first, people did not trust the results that come out of all this. Always trying to measure this quality - which, in this case, is very difficult because they are large data sets - from the beginning we started designing processes that take time. The daily quality control process of the data takes seven hours, but it is true that at the beginning we had to detect if an antenna had fallen, if something had happened... Then we do a control with statistical parameters and other internal consistency and what we detect here are the anomalies. What we are seeing is that 90% of the anomalies that come out are real mobility anomalies. In other words, there are no errors in the data, but they are anomalies: there has been a demonstration or there has been a football match. These are issues that distort mobility. Or there's been a storm or a rain or anything like that. And it is important not only to control that quality and see if there are anomalies, but we also believe that it is very important  to publish those quality criteria: how we are measuring quality and above all the results. Not only do we give the data on a daily basis, but we also give this metadata, which Alicia says, of quality, of what the sample was like that day, of those values that have been obtained from anomalies. This also occurs in the open: not only the data, but the metadata. And then we also publish the anomalies and the reason for those errors. When errors are found we say "okay, there has been an anomaly because in the town - I don't know what to imagine, it is all of Spain - del Casar was the festival of the Casar cake". And that's it, the anomaly has been found and it is published.

And how do you measure another quality parameter: thematic accuracy? In this case, comparing with sources of true reference. We know that evolution with respect to itself is already very controlled with that internal logical consistency, but we also have to compare it with what happens in the real world. I talked about it before with Alicia, we said "the data is reliable, but what is the reality of mobility? Who knows her?" In the end we have some clues, such as in the tickets of how many have boarded the buses. If we have that data, we have a clue, but of the people who walk and the people who take their cars and so on, what is the reality? It is very difficult to have a point of comparison, but we do compare it with all the data from AENA, Renfe, bus concessions and all these controls are passed to determine how far we deviate from that reality that we can know.

  1. All this data serves as a basis for developing applications and solutions, but it is also essential when it comes to making decisions and accelerating the implementation of the central axes, for example, the Safe, Sustainable and Connected Mobility Strategy or the Sustainable Mobility Bill. How is this data used to make these real decisions?

Tania Gullón: If you will allow me, I would first like to introduce this strategy and the Law on data for those who do not know it. One of the axes, axis 5 of the Ministry's Safe, Sustainable and Connected Mobility Strategy 2030 is "Smart Mobility". And it is precisely focused on this and its main objective  is to promote digitalisation, innovation and the use of advanced technologies to improve efficiency, sustainability and user experience in Spain's transport system. And precisely one of the measures of this axis is the "facilitation of Mobility as a Service, Open Data and New Technologies". In other words, this is where all these projects that we are commenting on are framed. In fact, one submeasure is to promote the publication of open mobility data, another is to carry out analysis of mobility flows and another of the measures, the last, is the creation of an integrated mobility data space. I would like to emphasize - and here I am already in line with this Bill that we hope we will soon see approved - that the Law, in Article 89, regulates the National Access Point, which we also see how it is included in this legislative instrument. And then the Law establishes a key digital instrument for the National Sustainable Mobility System: look at the importance given to the data that in a mobility law it is written that this integrated mobility data space is a key digital instrument. This data space is a reliable data sharing ecosystem, materialized as the digital infrastructure managed by the Ministry of Transport and in coordination with SEDIA (the Secretary of State for Digitalization and Artificial Intelligence), whose objective is to centralize and structure the information on mobility generated by public administrations, transport operators, infrastructure managers,  etc. and guarantee that open access to all this data for all administrations under regulatory conditions.

Alicia González: In this case, I want to say that any objective decision-making, of course, has to be made based on data that, as we said before, has to be reliable, up-to-date and comparable. In this sense, it should be noted that the IGN, the fundamental support it offers to the Ministry for the deployment of the Safe, Sustainable and Connected Mobility Strategy, is the provision of service data and complex analysis of geospatial information. Many of them, of course, about the set of data that we have been talking about transport networks.

In this sense, we would like to mention as an example the accessibility maps with which we contribute to axis 1 of the "Mobility for all" strategy, in which, through the Rural Mobility Table, the IGN was asked if we could generate maps that represented the cost in time and distance that it costs any citizen. Living in any population centre, access to the nearest transport infrastructure, starting with the road network. In other words, how much it costs a user in terms of effort, time and distance, to access the nearest motorway or dual carriageway from their home and then, by extension, to any road in the basic network. We did that analysis - so I said that this network is the backbone of the entire territory, it is continuous - and we finally published those results via the web. They are also open data, any user can consult them and, in addition, we also offer them not only numerically, but also represented in different types of maps. In the end, this geolocated visibility of the result provides fundamental value and facilitates, of course, strategic decision-making in terms of infrastructure planning.

Another example to highlight that is possible thanks to the availability of open data is the calculation of monitoring indicators of the Sustainable Development Goals of the 2030 Agenda. Currently, in collaboration with the National Institute of Statistics, we are working on the calculation of several of them, including one directly associated with Transport, which seeks to monitor goal 11, which is to make cities more inclusive, safe, resilient and sustainable.

  1. Speaking of this data-based decision-making, there is also cooperation at the level of data generation and reuse between different public administrations. Can you tell us about any examples of a project?

Tania Gullón: I also answer that to data-based decision-making, which I have previously beaten around the bush with the issue of the Law. It can also be said that all this Big Data data, Hermes and everything we have discussed is favouring this shift of the Ministry and organisations towards data-based organisations, which means that decisions are based on that analysis of objective data. When you ask like that for an example, I have so many that I wouldn't know what to tell you. In the case of Big Data data, it has been used for infrastructure planning for a few years now. Before, it was done with surveys and it was sized because how many lanes do I put on a road? Or something very basic, how often do we need on a train? Well, if you don't have data on what the demand is going to be, you can't plan it. This is done with Big Data data, not only by the Ministry but, as it is open, it is used by all administrations, all city councils and all infrastructure managers. Knowing the mobility needs of the population allows us to adapt our infrastructures and our services to these real needs. For example, commuter services in Galicia are now being studied. Or imagine the burying of the A-5. They are also used for emergencies, which we have not commented on, but they are also key. We always realize that when there is an emergency, suddenly everyone thinks "data, where is data, where is the open data?", because they have been fundamental. I can tell you, in the case of the Dana, which is perhaps the most recent, several commuter train lines were seriously affected, the tracks were destroyed, and 99% of the vehicles of the people who lived in Paiporta, in Torrent, in the entire affected area, were disabled. And 1% was because he was not in the Dana area at the time. So mobility had to be restored as soon as possible, because thanks to this open data in a week there were buses doing alternative transport services that had been planned with Big Data data. In other words, look at the impact on the population.

Speaking of emergencies, this project was born precisely because of an emergency, because of COVID. In other words, the study, this Big Data, was born in 2020 because the Presidency of the Government was in charge of monitoring this mobility on a daily basis and giving it openly. And here I link with that collaboration between administrations, organizations, companies, universities. Because look, these mobility data fed the epidemiological models. Here we work with the Carlos III Institute, with the Barcelona Supercomputing Center, with these institutes and research centers that were beginning to size hospital beds for the second wave. When we were still in the first wave, we didn't even know what a wave was and they were already telling us "be careful, because there is going to be a second wave, and with this mobility data and so on we will be able to measure how many beds are going to be needed, according to the epidemiological model". Look at the important reuse. We know that this data, for example, from Big Data is being used by thousands of companies, administrations, research centers, researchers around the world. In addition, we receive inquiries from Germany, from all countries, because in Spain we are a bit of a pioneer in this matter of giving all the data openly. We are there creating a school and not only for transportation, but for tourism issues as well, for example.

Alicia González: In the field of geographic information, at the level of cooperation, we have a specific instrument that is the National Cartographic System, which directly promotes coordination in the actions of the different administrations in terms of geographic information. We do not know how to work in any other way than by cooperating. And a clear example is the same set we have been talking about: the set of geographic reference information on transport networks is the result of this cooperation. That is to say, at the national level it is promoted and promoted by the Geographic Institute, but in its updating, regional cartographic agencies with different ranges of collaboration also participate in its production. The maximum is even reached for co-production of data from certain subsets in certain areas. In addition, one of the characteristics of this product is that it is generated from official data from other sources. In other words, there is already collaboration there no matter what. There is cooperation because there is an integration of data, because in the end it has to be filled in with the official data. And to begin with, perhaps it is data provided by INE, the Cadastre, the cartographic agencies themselves, the local street maps... But, once the result has been formed, as I mentioned before, the result has an added value that is of interest to the original supplier itself. For example, this dataset is reused internally, at home, in the IGN: any product or service that requires transport information is fed into this dataset. There is an internal reuse there, but also in the field of public administrations, at all levels. In the state sector, for example, in the Cadastre, once the result has been generated, it is of interest to them for studies to analyse the delimitation of the public domain associated with infrastructures, for example. Or the Ministry itself, as Tania commented before. Hermes was generated from RT data processing, from transport network data. The Directorate-General for Roads uses transport networks in its internal management to make a traffic map, its catalogue management, etc. And in the autonomous communities themselves, the result generated is also useful to them in cartographic agencies or even at the local level. Then there is a continuous cyclical reuse, as it should be, in the end everything is public money and it has to be reused as much as possible. And in the private sphere, it is also reused and value-added services are generated from this data that are provided in multiple use cases. Not to go on too long, simply that: we participate by providing data on which value-added services are generated.

  1. And finally, you can briefly recap some ideas that highlight the impact on daily life and the commercial potential of this data for reusers.

Alicia González: Very briefly, I think that the fundamental impact on everyday life is that the distribution of open data has made it possible to democratize access to data for everyone, for companies, but also for citizens; and, above all, I think it has been fundamental in the academic field, where surely, currently,  it is easier to develop certain investigations that in other times were more complex. And another impact on daily life is the institutional transparency that this implies. And as for the commercial potential of reusers, I reiterate the previous idea: the availability of data drives innovation and the increase of value-added solutions. In this sense, looking at one of the conclusions of the report that was carried out in 2024 by ASEDIE; the Association of Infomedia Companies, on the impact that the geospatial data published by the CNIG had on the private sector, there were a couple of quite important conclusions. One of them said that every time a new set of data is released, reusers are incentivized to generate value-added solutions and, in addition, it allows them to focus their efforts on this development of innovation and not so much on data capture. And it was also clear from that report that since the adoption of the open data policy that I mentioned at the beginning, which was adopted in 2015 by the IGN, 75% of the companies surveyed responded that they had been able to significantly expand the catalogue of products and services based on this open data. Then, I believe that the impact is ultimately enriching for society as a whole.

Tania Gullón: I subscribe to all of Alicia's words, I totally agree. And also, that small transport operators and municipalities with fewer resources have at their disposal all this open and free quality data and access to digital tools that allow them to compete on equal terms. In the case of companies or municipalities, imagine being able to plan their transport and be more efficient. Not only does it save them money, but they win in the end in the service to the citizen. And of course, the fact that in the public sector decisions are made based on data and this ecosystem of data sharing is encouraged, favouring the development of mobility applications, for example, has a direct impact on people's daily lives. Or also the issue of transport aid: the study of the impact of transport subsidies with accessibility data and so on. You study who are the most vulnerable and in the end, what do you do? Well, that policies are increasingly fairer and this obviously impacts the citizen. That decisions about how to invest everyone's money, our taxes, how to invest it in infrastructure or aid or services, should be based on objective data and not on intuitions, but on real data. This is the most important thing.

calendar icon
Entrevista

In this episode we talk about the environment, focusing on the role that data plays in the ecological transition. Can open data help drive sustainability and protect the planet? We found out with our two guests:

  • Francisco José Martínez García, conservation director of the natural parks of the south of Alicante.
  • José Norberto Mazón, professor of computer languages and systems at the University of Alicante.

Listen here the full episode (in spanish)

Summary of the interview

  1. You are both passionate about the use of data for society, how did you discover the potential of open data for environmental management?

Francisco José Martínez: For my part, I can tell you that when I arrived at the public administration, at the Generalitat Valenciana, the Generalitat launched a viewer called Visor Gva, which is open, which gives a lot of information on images, metadata, data in various fields... And the truth is that it made it much easier for me - and continues to make it easier - for me to work on the resolution of files and the work of a civil servant. Later, another database was also incorporated, which is the Biodiversity Data Bank, which offers data in grids of one kilometer by one kilometer. And finally, already applied to the natural spaces and wetlands that I direct, water quality data, all of them are open and can be the object of generating applied research by all researchers.

Jose Norberto Mazón: In my case, it was precisely with Francisco as director. He directs three natural parks that are wetlands in the south of Alicante and about one of them, in which we had special interest, which is the Natural Park of Laguna de la Mata and Torrevieja, Francisco told us about his experience -all this experience that he has just commented on-. We at the University of Alicante have been working for some time on data management, open data, data interoperability, etc., and we saw the opportunity to make a perspective of data management, data generation and reuse of data from the territory, from the Natural Park itself. Together with other entities such as Proyecto MastralFaunaturaAGAMED, and also colleagues from the Polytechnic University of Valencia, we saw the possibility of studying these useful data, focusing above all on the concept of high-value data, which the European Union was betting on them: data that has the potential to generate socio-economic or environmental benefits, benefit all users and contribute to making a European society based on the data economy. And well, we set out there to see how we could collaborate, especially to discover the potential of data at the territory level.

  1. Through a strategy called the Green Deal, the European Union aims to become the world's first competitive and resource-efficient economy, achieving net-zero greenhouse gas emissions by 2050. What concrete measures are most urgent to achieve this and how can data help to achieve these goals?

Francisco José Martínez: The European Union has several lines, several projects such as the LIFE project, focused on endangered species, the ERDF funds to restore habitats... Here in Laguna de la Mata and Torrevieja, we  have improved terrestrial habitats with these ERDF funds and it is precisely about these habitats being better CO2 capturers and generating more native plant communities, eliminating invasive species. Then we also have the regulation, at the regulatory level, on nature restoration, which has been in force since 2024, and which requires us to restore up to 30% of degraded terrestrial and marine ecosystems. I must also say that the Biodiversity Foundation, under the Ministry, generates quite a few projects related, for example, to the generation of climate shelters in urban areas. In other words, there are a series of projects and a lot of funding in everything that has to do with renaturalization, habitat improvement and species conservation.

Jose Norberto Mazón: I would also focus, to complement what Francis has said, on all data management, the importance given to data management at the level of the European Green Deal, specifically with data sharing projects, to make data more interoperable. In other words, in the end, all those actors that generate data can be useful through their combination and generate much more value in what are called data spaces and especially in the data space of the European Green Deal. Recently, in addition, they have just finished some initial projects. For example, to highlight a couple of them, the USAGE project (Urban Data Spaces for Green dEal), which I am going to comment on with two specific pilots that they have developed very interestingly. One on how everything that has to do with data to mitigate climate change has to be introduced into urban management in the city of Ferrara, in Italy. And another pilot on data governance and how it has to be done to comply with the FAIR principles, in this case in Zaragoza, with a concept of climate islands that is also very interesting. And then there is another project, AD4GD (All Data for Green Deal) that has also carried out pilots in relation to this interoperability of data. In this case, in the Berlin Lake Network. Berlin has about 300 lakes that have to monitor the quality of the water, the quantity of water, etc. and it has been done through sensorization. The management of biological corridors in Catalonia, too, with data on how species move and how it is necessary to manage these biological corridors. And they have also done some air quality initiatives with citizen science. These projects have already been completed, but there is a super interesting project at the European level that is going to launch this large data space of the European Pact, which is the SAGE (Sustainable Green Europe Data Space) project, which is developing ten use cases that encompass this entire great area of the European Green Deal. Specifically, to highlight one that is very pertinent, because it is aligned with what are the natural parks, the wetlands of the south of Alicante and that Francisco directs, is that of the commitments between nature and ecosystem services. That is, how nature must be protected, how we have to conserve, but we also have to allow these socio-economic activities in a sustainable way. This data space will integrate remote sensing, models based on artificial intelligence, data, etc.

  1. Would you like to add any other projects at this local or regional level?

Francisco José Martínez: Yes, of course. Well, the one we have done with Norberto, his team and several teams, several departments of the Polytechnic University of Valencia and the University of Alicante, and it is the digital twin. Research has been carried out for the generation of a digital twin in the Natural Park of Las Lagunas, here in Torrevieja. And the truth is that it has been an applied research, a lot of data has been generated from sensors, also from direct observations or from image and sound recorders. A good record of information has been made at the level of noise, climate, meteorological data to be able to carry out good management and that it is an invaluable help for the management of those of us who have to make decisions day by day. Other data that have also been carried out in this project here has been the collection of data of a social nature, tourist use, people's feelings (whether they agree with what they see in the natural space or not). In other words, we have improved our knowledge of this natural space thanks to this digital twin and that is information that neither our viewer nor the Biodiversity Data Bank can provide us.

Jose Norberto Mazón: Francisco was talking, for example, about the knowledge of people, about the influx of people from certain areas of the natural park. And also to know what they feel, what the people who visit it think, because if it is not through surveys that are very cumbersome, etcetera is complicated. We have put at the service of discovering that knowledge, this digital twin with a multitude of that sensorization and with data that in the end are also interoperable and that allow us to know the territory very well. Obviously, the fact that it is territorial does not mean that it is not scalable. What we are doing with the digital twin project, the ChanTwin project, what we are doing is that it can be dumped or extrapolated to any other natural area, because the problems that we have had in the end we are going to find in any natural area, such as connectivity problems, interoperability problems of data that come from sensors,  etc. We have sensors of many types, influx of people, water quality, temperatures and climatic variables, pollution, etc. and in the end also with all the guarantees of data privacy. I have to say this, which is very important because we always try to ensure that this data collection, of course, guarantees people's privacy. We can know the concerns of the people who visit the park and also, for example, the origin of those people. And this is very interesting information at the level of park management, because in this way, for example, Francisco can make more informed decisions to better manage the park. But, the people who visit the park come from a specific municipality, with a city council that, for example, has a Department of the Environment or has a Department of Tourism. And this information can be very interesting to highlight certain aspects, for example, environmental, biodiversity, or socio-economic activity.

Francisco José Martínez: Data are fundamental in the management of the natural environment of a wetland, a mountain, a forest, a pasture... in general of all natural spaces. Note that only with the follow-up and monitoring of certain environmental parameters do we serve to explain events that can happen, for example, a fish mortality. Without having had the history of the dissolved oxygen temperature data, it is very difficult to know if it is because of that or because of a pollutant. For example, the temperature of water, which is related to dissolved oxygen: the higher the temperature, the less dissolved oxygen. And without oxygen, it turns out that they appear in spring and summer – okay, whatever the ambient temperatures are, it moves to the water, to the lagoons, to the wetlands – a disease appears that is botulism and there have already been two years that more than a thousand animals have died every year. The way to control it is by anticipating that these temperatures are going to reach a specific one, that from there the oxygen almost disappears from the waters and gives us time to plan the work teams that are removing the corpses, which is the fundamental action to avoid it. Another, for example, is the monthly census of waterfowl, which are observed in person, which are recorded, which we also have recorders that record sounds. With that we can know the dynamics when species come in migration and with that we can also manage water. Another example can be that of the temperature of the lagoon here in La Mata, which we are monitoring with the digital twin, because we know that when it reaches almost thirty degrees, the main food of the birds disappears, which is brine shrimp, because they cannot live in those extreme temperatures with that salinity. but we can bring in sea water, which despite the fact that it has been very hot these last springs and summers, is always cooler and we can refresh and extend the life of this species that is precisely synchronized with the reproduction of birds. So we can manage the water thanks to the monitoring and thanks to the data we have on the temperatures of the waters.

Jose Norberto Mazón: Look at the importance of these examples that Francisco mentioned, which are paradigmatic, and also the importance of the use of data. I would simply add a question that in the end these data, the effort is to make them all open and that they comply with those FAIR principles, that is, that they are interoperable, because as we have heard Francis have commented, they are data from many sources, each with different characteristics, collected in different ways, etc. You're talking to us about sensor data, but also other data that is collected in another way. And then also that they allow us in some way to start co-creation processes of tools that use this data at various levels. Of course, at the level of management of the natural park itself to make informed decisions, but also at the level of citizenship, even at the level of other types of professionals. As Francisco said, in the parks, in these wetlands, economic activities are carried out and therefore also being able to co-create tools with these actors or with the university research staff themselves is very interesting. And here it is always a matter of encouraging third parties, both natural and legal, for example, companies or startups, entrepreneurs, etc. that they make various applications and value-added services with that data: that they design easy-to-use tools for decision-making, for example, or any other type of tool. This would be very interesting, because it would also give us an entrepreneurial ecosystem around that data. And what this would also do is make society itself more involved from this open data, from the reuse of open data, in environmental care and environmental awareness.

  1. An important aspect of this transition is that it must be "fair and leave no one behind". What role can data play in ensuring that equity?

Francisco José Martínez: In our case, we have been carrying out citizen science actions with the Environmental Education and Dissemination technicians. We are collecting data with people who sign up for these activities. We do two activities a month and, for example, we have carried out censuses of bats of different species - because one sees bats and does not distinguish the species, sometimes not even seeing them - on night routes, to detect and record them. We have also done photo trapping activities to detect mammals that are very difficult to see. With this we get children, families, people in general to know a fauna that they do not know exists when they are walking in the mountains. And I believe that we reach a lot of people and that we are disseminating it to as many people, as many sectors as possible.

Jose Norberto Mazón: And from that data, in fact, look at all the amount of data that Francis is talking about. From there, and promoting that line that Francisco follows as director of the Natural Parks of the south of Alicante, what we ask ourselves is: can we go one step further using technology? And we have made video games that make it possible to have more awareness among those target groups that may otherwise be very difficult to reach. For example, teenagers, who must be instilled in some way that behavior, that importance of natural parks as well. And we think that video games can be a very interesting channel. And how have we done it? Basing these video games on data, on data that come from what Francisco has commented on and also from the data of the digital twin itself. That is, data we have on the water surface, noise levels... We include all this data in video games. They are dynamic video games that allow us to have a better awareness of what the natural park is and of the environmental values and conservation of biodiversity.

  1. You've been talking to us for a while about all the data you use, which in the end comes from various sources. Can we summarize the type of data you use in your day-to-day life and what are the challenges you encounter when integrating it into specific projects?

Francisco José Martínez: The data are spatial, they are images with their metadata, censuses of birds, mammals, the different taxonomic groups, fauna, flora... We also carry out inventories of protected flora in danger of extinction. Fundamental meteorological data that, by the way, are also very important when it comes to the issue of civil protection. Look at all the disasters that there are with cold drops or cut-off lows. Very important data such as water quality, physical and chemical data, height of the water sheet that helps us to know evaporation, evaporation curves and thus manage water inputs and of course, social data for public use. Because public use is very important in natural spaces. It is a way of opening up to citizens, to people so that they can know their natural resources and know them, value them and thus protect them. As for the difficulty, it is true that there is a series of data, especially when research is carried out, which we cannot access. They are in repositories for technicians who are in the administration or even for consultants who are difficult to access. I think Norberto can explain this better: how this could be integrated into platforms, by sectors, by groups...

Jose Norberto Mazón: In fact, it is a core issue for us. In the end there is a lot of open data, as Francis has explained throughout this little time that we have been talking, but it is true that they are very dispersed because they are also generated to meet various objectives. In the end, the main objective of open data is that it is reused, that is, that it is used for purposes other than those for which it was initially granted. But what we find is that in the end there are many proposals that are, as we would say, top-down (very top down). But really, where the problem lies is in the territory, from below, in all the actors involved in the territory, which apart from a lot of data is generated in the territory itself. In other words, it is true that there is data, for example, satellite data with remote sensing, which is generated by the satellites themselves and then reused by us, but then the data that comes from sensors or the data that comes from citizen science, etc., are generated in the territory itself. And we find that many times, in the end of that data, for example, if there are researchers who do a job in a specific natural park, then obviously that research team publishes its articles and data in open (because by the law of science they have to publish them in open in repositories). But of course, that is very research-oriented. So, the other types of actors, for example, the management of the park, the managers of a local entity or even the citizens themselves, are perhaps not aware that this data is available and do not even have mechanisms to consult it and obtain value from it. The greatest difficulty, in fact, is this, in that the data generated from the territory is reused from the territory. It is very easy to reuse them from the territory to solve these problems as well. And that difficulty is what we are trying to tackle with these projects that we have underway, at the moment with the creation of a data lake, a data architecture that allows us to manage all that heterogeneity of the data and do it from the territory. But of course, here what we really have to do is try to do it in a federated way, with that philosophy of open data at the federated level and also with a plus as well, because it is true that the casuistry within the territory is very large. There are a multitude of actors, because we are talking about open data, but there may also be actors who say "I want to share certain data, but not certain other data yet, because I may lose a certain competitiveness, but I would not mind being able to share it in three months' time". In other words, it is also necessary to have control over a certain type of data and that open data coexists with another type of data that can be shared. Maybe not so broadly, but in a way, let's say, providing great value. We are looking at this possibility with a new project that we are creating: a space for environmental data, biodiversity in these three natural parks in the south of the province of Alicante, and we are working on that project: Heleade.

If you want to know more about these projects, we invite you to visit their websites:

Interview clips

1. How was the digital twin of the Lagunas de Torrevieja Natural Park conceived? 

 

2. What projects are being promoted within the framework of the European Green Deal Data Space? 

calendar icon
Entrevista

Do you know why it is so important to categorize datasets? Do you know the references that exist to do it according to the global, European and national standards? In this podcast we tell you the keys to categorizing datasets and guide you to do it in your organization.

  • David Portolés, Project Manager of the Advisory Service
  • Manuel Ángel Jáñez, Senior Data Expert

Listen to the full podcast (Spanish)

Summary of the interview

  1. What do we mean when we talk about cataloguing data and why is it so important to do so?

David Portolés: When we talk about cataloguing data, what we want is to describe it in a structured way. In other words, we talk about metadata: information related to data. Why is it so important? Because thanks to this metadata, interoperability is achieved. This word may sound complicated, but it simply means that systems can communicate with each other autonomously.

Manuel Ángel Jañez: Exactly, as David says, categorizing is not just labeling. It is about providing data with properties that make it understandable, accessible and reusable. For that we need agreements or standards. If each producer defines their own rules, consumers will not be able to interpret them correctly, and value is lost. Categorizing is reaching consensus between the general and the specific, and this is not new: it is an evolution of library documentation, adapted to the digital environment.

  1. So we understand that interoperability is speaking the same language to get the most out of it. What references are there at global, European and national level?

Manuel Ángel Jáñez:  The way to describe data is in an open way, using standards or reference specifications, of frames.

  • Globally: DCAT (a W3C recommendation) allows you to model catalogs, datasets, distributions, services, etc. In essence, all the entities that are key and that are then reused in the rest of the profiles.

  • At the European level: DCAT-APthe application profile in data portals in the European Union, particularly those corresponding to the public sector. This is essentially what is used for the Spanish profile, DCAT-AP-ES.

  • In Spain: DCAT-AP-ES, is the context in which more specific restrictions are incorporated at the Spanish level. It is a profile based on the  2013 Technical Standard for Interoperability (NTI). This profile adds new features, evolves the model to make it compatible with the European standard, adds features related to high-value sets (HVDs), and adapts the standard to the present of the data ecosystem.

David Portolés:  With a good description, the reuser can search, retrieve and locate the datasets that are of interest to them and, on the other hand, discover other new datasets that they had not contemplated. The standards, the models, the shared vocabularies. The main difference between them is the degree of detail they apply. The key is to reach the compromise between being as general as possible so that they are not restrictive, but, on the other hand, it is necessary to be specific, it is necessary that they are also specific. Although we talk a lot about open data, these standards also apply to protected data that can be described. The universe of application of these standards is very broad.

  1. Focusing on DCAT-AP-ES, what help or resources are there for a user to implement it?

David Portolés: DCAT-AP-ES is a set of rules and basic application models. Like any technical standard, it has an application guide and, in addition, there is an online implementation guide with examples, conventions, frequently asked questions and spaces for technical and informative discussion. This guide has a very clear purpose, the idea is to create a community around this technical standard, with the purpose of generating a knowledge base accessible to all, a transparent and open support channel for anyone who wants to participate.

Manuel Ángel Jañez: The available resources do not start from zero. Everything is aligned with European initiatives such as SEMIC, which promotes semantic interoperability in the EU. We want a living and dynamic tool that evolves with the needs, under a participatory approach, with good practices, debates, harmonisation of the profile, etc. In short, the aim is for the model to be useful, robust, easy to maintain over time and flexible enough so that anyone can participate in its improvement.

  1. Is there any existing thematic implementation in DCAT-AP-ES?

Manuel Ángel Jáñez:  Yes, important steps have been taken in that direction. For example, the model of high-value sets has already been included, key for data relevant to the economy or society, useful for AI, for example.  DCAT-AP-ES is inspired by profiles such as DCAT-AP v2.1.1 (2022) that incorporates some semantic improvements, but there are still thematic implementations to be incorporated into DCAT-AP-ES, such as data series. The idea is that thematic extensions will enable modelling for specific datasets.

David Portolés: As Manu says, the idea is that he is a living model.  Future possible extensions are:

  • Geographical data: GeoDCAT-AP (European).
  • Statistical data: StatDCAT-AP.

In addition, future directives on high-value data will have to be taken into account.

  1. And what are the next objectives for the development of DCAT-AP-ES?

David Portolés: The main objective is to achieve full adoption by:

  • Vendors: To change the way they offer and disseminate their metadata relative to their datasets with this new paradigm

  • Reusers: that integrate the new profile in their developments, in their systems, and in all the integrations they have made so far, and that they can make much better derivative products.

Manuel Ángel Jáñez:  Also to maintain coherence with international standards such as DCAT-AP. We want to continue to be committed to an agile, participatory technical governance model aligned with emerging technologies (such as protected data, sovereign data infrastructures and data spaces). In short: that DCAT-AP-ES is useful, flexible and future-prepared,

Interview clips

1. Why is it important to catalog data?

2. How can we describe data in open formats?

calendar icon
Entrevista

Collaborative culture and citizen open data projects are key to democratic access to information. This contributes to free knowledge that allows innovation to be promoted and citizens to be empowered.

In this new episode of the datos.gob.es podcast, we are joined by two professionals linked to citizen projects that have revolutionized the way we access, create and reuse knowledge. We welcome:

  • Florencia Claes, professor and coordinator of Free Culture at the Rey Juan Carlos University, and former president of Wikimedia Spain.
  • Miguel Sevilla-Callejo, researcher at the CSIC (Spanish National Research Council) and Vice-President of the OpenStreetMap Spain association.

Listen the episode (in spanish) 

  1. How would you define free culture?

Florencia Claes: It is any cultural, scientific, intellectual expression, etc. that as authors we allow any other person to use, take advantage of, reuse, intervene in and relaunch into society, so that another person does the same with that material.

In free culture, licenses come into play, those permissions of use that tell us what we can do with those materials or with those expressions of free culture.

  1. What role do collaborative projects have within free culture?

Miguel Sevilla-Callejo: Having projects that are capable of bringing together these free culture initiatives is very important. Collaborative projects are horizontal initiatives in which anyone can contribute. A consensus is structured around them to make that project, that culture, grow.

  1. You are both linked to collaborative projects such as Wikimedia and OpenStreetMap. How do these projects impact society?

Florencia Claes: Clearly the world would not be the same without Wikipedia. We cannot conceive of a world without Wikipedia, without free access to information. I think Wikipedia is associated with the society we are in today. It has built what we are today, also as a society. The fact that it is a collaborative, open, free space, means that anyone can join and intervene in it and that it has a high rigor.

So, how does it impact? It impacts that (it will sound a little cheesy, but...) we can be better people, we can know more, we can have more information. It has an impact on the fact that anyone with access to the internet, of course, can benefit from its content and learn without necessarily having to go through a paywall or be registered on a platform and change data to be able to appropriate or approach the information.

Miguel Sevilla-Callejo: We call OpenStreetMap the Wikipedia of maps, because a large part of its philosophy is copied or cloned from the philosophy of Wikipedia. If you imagine Wikipedia, what people do is they put encyclopedic articles. What we do in OpenStreetMap is to enter spatial data. We build a map collaboratively and this assumes that the openstreetmap.org page, which is where you could go to look at the maps, is just the tip of the iceberg. That's where OpenStreetMap is a little more diffuse and hidden, but most of the web pages, maps and spatial information that you are seeing on the Internet, most likely in its vast majority, comes from the data of the great free, open and collaborative database that is OpenStreetMap.

Many times you are reading a newspaper and you see a map and that spatial data is taken from OpenStreetMap. They are even used in agencies: in the European Union, for example, OpenStreetMap is being used. It is used in information from private companies, public administrations, individuals, etc. And, in addition, being free, it is constantly reused.

I always like to bring up projects that we have done here, in the city of Zaragoza. We have generated the entire urban pedestrian network, that is, all the pavements, the zebra crossings, the areas where you can circulate... and with this a calculation is made  of how you can move around the city on foot. You can't find this information on sidewalks, crosswalks and so on on a website because it's not very lucrative, such as getting around by car, and you can take advantage of it, for example, which is what we did in some jobs that I directed at university, to be able to know how different mobility is with blind people.  in a wheelchair or with a baby carriage.

  1. You are telling us that these projects are open. If a citizen is listening to us right now and wants to participate in them, what should they do to participate? How can you be part of these communities?

Florencia Claes: The interesting thing about these communities is that you don't need to be formally associated or linked to them to be able to contribute. In Wikipedia you simply enter the Wikipedia page and become a user, or not, and you can edit. What is the difference between making your username or not? In that you will be able to have better access to the contributions you have made, but we do not need to be associated or registered anywhere to be able to edit Wikipedia.

If there are groups at the local or regional level related to the Wikimedia Foundation that receive grants and grants to hold meetings or activities. That's good, because you meet people with the same concerns and who are usually very enthusiastic about free knowledge. As my friends say, we are a bunch of geeks who have met and feel that we have a group of belonging in which we share and plan how to change the world.

Miguel Sevilla-Callejo: In OpenStreetMap it is practically the same, that is, you can do it alone. It is true that there is a bit of a difference with respect to Wikipedia. If you go to the openstreetmap.org page, where we have all the documentation – which is wiki.OpenStreetMap.org – you can go there and you have all the documentation.

It is true that to edit in OpenStreetMap you do need a user to better track the changes that people make to the map. If it were anonymous there could be more of a problem, because it is not like the texts in Wikipedia. But as Florencia said, it's much better if you associate yourself with a community.

We have local groups in different places. One of the initiatives that we have recently reactivated is the OpenStreetMap Spain association, in which, as Florencia said, we are a group of those who like data and free tools, and there we share all our knowledge. A lot of people come up to us and say "hey, I just entered OpenStreetMap, I like this project, how can I do this? How can I do the other?"And well, it's always much better to do it with other colleagues than to do it alone. But anyone can do it.

  1. What challenges have you encountered when implementing these collaborative projects and ensuring their sustainability over time? What are the main challenges, both technical and social, that you face?

Miguel Sevilla-Callejo: One of the problems we find in all these movements that are so horizontal and in which we have to seek consensus to know where to move forward, is that in the end it is relatively problematic to deal with a very diverse community. There is always friction, different points of view... I think this is the most problematic thing. What happens is that, deep down, as we are all moved by enthusiasm for the project, we end up reaching agreements that make the project grow, as can be seen in Wikimedia and OpenStreetMap themselves, which continue to grow and grow.

From a technical point of view, for some things in particular, you have to have a certain computer prowess, but we are very, very basic. For example, we have made mapathons, which consist of us meeting in an area with computers and starting to put spatial information in areas, for example, where there has been a natural disaster or something like that. Basically, on a satellite image, people place little houses where they see - little houses there in the middle of the Sahel, for example, to help NGOs such as Doctors Without Borders. That's very easy: you open it in the browser, open OpenStreetMap and right away, with four prompts, you're able to edit and contribute.

It is true that, if you want to do things that are a little more complex, you have to have more computer skills. So it is true that we always adapt. There are people who are entering data in a very pro way, including buildings, importing data from the cadastre... and there are people like a girl here in Zaragoza who recently discovered the project and is entering the data they find with an application on their mobile phone.

I do really find a certain gender bias in the project. That within OpenStreetMap worries me a little, because it is true that a large majority of the people we are editing, including the community, are men and that in the end does mean that some data has a certain bias. But hey, we're working on it.

Florencia Claes: In that sense, in the Wikimedia environment, that also happens to us. We have, more or less worldwide, 20% of women participating in the project against 80% of men and that means that, for example, in the case of Wikipedia, there is a preference for articles about footballers sometimes. It is not a preference, but simply that the people who edit have those interests and as they are more men, we have more footballers, and we miss articles related, for example, to women's health.

So we do face biases and we face that coordination of the community. Sometimes people with many years participate, new people... and achieving a balance is very important and very difficult. But the interesting thing is when we manage to keep in mind or remember that the project is above us, that we are building something, that we are giving something away, that we are participating in something very big. When we become aware of that again, the differences calm down and we focus again on the common good which, after all, I believe is the goal of these two projects, both in the Wikimedia environment and OpenStreetMap.

  1. As you mentioned, both Wikimedia and OpenStreetMap are projects built by volunteers. How do you ensure data quality and accuracy?

Miguel Sevilla-Callejo: The interesting thing about all this is that the community is very large and there are many eyes watching. When there is a lack of rigor in the information, both in Wikipedia – which people know more about – but also in OpenStreetMap, alarm bells go off. We have tracking systems and it's relatively easy to see dysfunctions in the data. Then we can act quickly. This gives a capacity, in OpenStreetMap in particular, to react and update the data practically immediately and to solve those problems that may arise also quite quickly. It is true that there has to be a person attentive to that place or that area.

I've always liked to talk about OpenStreetMap data as a kind of - referring to it as it is done in the software - beta map, which has the latest, but there can be some minimal errors. So, as a strongly updated and high-quality map, it can be used for many things, but for others of course not, because we have another reference cartography that is being built by the public administration.

Florencia Claes: In the Wikimedia environment we also work like this, because of the mass, because of the number of eyes that are looking at what we do and what others do. Each one, within this community, is assuming roles. There are roles that are scheduled, such as administrators or librarians, but there are others that simply: I like to patrol, so what I do is keep an eye on new articles and I could be looking at the articles that are published daily to see if they need any support, any improvement or if,  on the contrary, they are so bad that they need to be removed from the main part or erased.

The key to these projects is the number of people who participate and everything is voluntary, altruistic. The passion is very high, the level of commitment is very high. So people take great care of those things. Whether data is curated to upload to Wikidata or an article is written on Wikipedia, each person who does it, does it with great affection, with great zeal. Then time goes by and he is aware of that material that he uploaded, to see how it continued to grow, if it was used, if it became richer or if, on the contrary, something was erased.

Miguel Sevilla-Callejo: Regarding the quality of the data, I find interesting, for example, an initiative that the Territorial Information System of Navarre has now had. They have migrated all their data for planning and guiding emergency routes to OpenStreetMap, taking their data. They have been involved in the project, they have improved the information, but taking what was already there [in OpenStreetMap], considering that they had a high quality and that it was much more useful to them than using other alternatives, which shows the quality and importance that this project can have.

  1. This data can also be used to generate open educational resources, along with other sources of knowledge. What do these resources consist of and what role do they play in the democratization of knowledge?

Florencia Claes:  OER, open educational resources, should be the norm. Each teacher who generates content should make it available to citizens and should be built in modules from free resources. It would be ideal.

What role does the Wikimedia environment have in this? From housing information that can be used when building resources, to providing spaces to perform exercises or to take, for example, data and do work with SPARQL. In other words, there are different ways of approaching Wikimedia projects in relation to open educational resources. You can intervene and teach students how to identify data, how to verify sources, to simply make a critical reading of how information is presented, how it is curated, and make, for example, an assessment between languages.

Miguel Sevilla-Callejo: OpenStreetMap is very similar. What's interesting and unique is what the nature of the data is. It's not exactly information in different formats like in Wikimedia. Here the information is that free spatial database that is OpenStreetMap. So the limits are the imagination.

I remember that there was a colleague who went to some conferences and made a cake with the OpenStreetMap map. He would feed it to the people and say, "See? These are maps that we have been able to eat because they are free." To make more serious or more informal or playful cartography, the limit is only your imagination. It happens exactly the same as with Wikipedia.

  1. Finally, how can citizens and organizations be motivated to participate in the creation and maintenance of collaborative projects linked to free culture and open data?

Florencia Claes: I think we have to clearly do what Miguel said about the cake. You have to make a cake and invite people to eat cake. Seriously talking about what we can do to motivate citizens to reuse this data, I believe, especially from personal experience and from the groups with which I have worked on these platforms, that the interface is friendly is a very important step.

In Wikipedia in 2015, the visual editor was activated. The visual editor made us join many more women to edit Wikipedia. Before, it was edited only in code and code, because at first glance it can seem hostile or distant or "that doesn't go with me". So, to have interfaces where people don't need to have too much knowledge to know that this is a package that has such and such data and I'm going to be able to read it with such a program or I'm going to be able to dump it into such and such a thing and make it simple, friendly, attractive... I think that this is going to remove many barriers and that it will put aside the idea that data is for computer scientists. And I think that data goes further, that we can really take advantage of all of them in very different ways. So I think it's one of the barriers that we should overcome.

Miguel Sevilla-Callejo: It didn't happen to us that until about 2015 (forgive me if it's not exactly the date), we had an interface that was quite horrible, almost like the code edition you have in Wikipedia, or worse, because you had to enter the data knowing the labeling, etc. It was very complex. And now we have an editor that basically you're in OpenStreetMap, you hit edit and a super simple interface comes out. You don't even have to put labeling in English anymore, it's all translated. There are many things pre-configured and people can enter the data immediately and in a very simple way. So what that has allowed is that many more people come to the project.

Another very interesting thing, which also happens in Wikipedia, although it is true that it is much more focused on the web interface, is that around OpenStreetMap an ecosystem of applications and services has been generated that has made it possible, for example, to appear mobile applications that, in a very fast, very simple way, allow data to be put directly on foot on the ground. And this makes it possible for people to enter the data in a simple way.

I wanted to stress it again, although I know that we are reiterating all the time in the same circumstance, but I think it is important to comment on it, because I think that we forget that within the projects: we need people to be aware again that data is free, that it belongs to the community,  that it is not in the hands of a private company, that it can be modified, that it can be transformed, that behind it there is a community of voluntary, free people, but that this does not detract from the quality of the data, and that it reaches everywhere. So that people come closer and don't see us as a weirdo. I think that Wikipedia is much more integrated into society's knowledge and now with artificial intelligence much more, but it happens to us in OpenStreetMap, that they look at you like saying "but what are you telling me if I use another application on my mobile?" or you're using ours, you're using OpenStreetMap data without knowing it. So we need to get closer to society, to get them to know us better.

Returning to the issue of association, that is one of our objectives, that people know us, that they know that this data is open, that it can be transformed, that they can use it and that they are free to have it to build, as I said before, what they want and the limit is their imagination.

Florencia Claes: I think we should somehow integrate through gamification, through games in the classroom, the incorporation of maps, of data within the classroom, within the day-to-day schooling. I think we would have a point in our favour there. Given that we are within a free ecosystem, we can integrate visualization or reuse tools on the same pages of the data repositories  that I think would make everything much friendlier and give a certain power to citizens, it would empower them in such a way that they would be encouraged to use them.

Miguel Sevilla-Callejo:  It's interesting that we have things that connect both projects (we also sometimes forget the people of OpenStreetMap and Wikipedia), that there is data that we can exchange, coordinate and add. And that would also add to what you just said.

Subscribe to our Spotify profile to keep up to date with our podcasts

 

Interview clips

 

1. What is OpenStreetMap?

 

 

2. How does Wikimedia help in the creation of Open Educational Resources?

 

calendar icon
Entrevista

Open knowledge is knowledge that can be reused, shared and improved by other users and researchers without noticeable restrictions. This includes data, academic publications, software and other available resources. To explore this topic in more depth, we have representatives from two institutions whose aim is to promote scientific production and make it available in open access for reuse:

  • Mireia Alcalá Ponce de León, Information Resources Technician of the Learning, Research and Open Science Area of the Consortium of University Services of Catalonia (CSUC).
  • Juan Corrales Corrillero, Manager of the data repository of the Madroño Consortium.

 

Listen here the podcast (in spanish)

 

Summary of the interview

1.Can you briefly explain what the institutions you work for do?

Mireia Alcalá: The CSUC is the Consortium of University Services of Catalonia and is an organisation that aims to help universities and research centres located in Catalonia to improve their efficiency through collaborative projects. We are talking about some 12 universities and almost 50 research centres.
We offer services in many areas: scientific computing, e-government, repositories, cloud administration, etc. and we also offer library and open science services, which is what we are closest to. In the area of learning, research and open science, which is where I am working, what we do is try to facilitate the adoption of new methodologies by the university and research system, especially in open science, and we give support to data management research.

Juan Corrales: The Consorcio Madroño is a consortium of university libraries of the Community of Madrid and the UNED (National University of Distance Education) for library cooperation.. We seek to increase the scientific output of the universities that are part of the consortium and also to increase collaboration between the libraries in other areas. We are also, like CSUC, very involved in open science: in promoting open science, in providing infrastructures that facilitate it, not only for the members of the Madroño Consortium, but also globally. Apart from that, we also provide other library services and create structures for them.

2. What are the requirements for an investigation to be considered open?

Juan Corrales: For research to be considered open there are many definitions, but perhaps one of the most important is given by the National Open Science Strategy, which has six pillars.

One of them is that it is necessary to put in open access both research data and publications, protocols, methodologies.... In other words, everything must be accessible and, in principle, without barriers for everyone, not only for scientists, not only for universities that can pay for access to these research data or publications. It is also important to use open source platforms that we can customise. Open source is software that anyone, in principle with knowledge, can modify, customise and redistribute, in contrast to the proprietary software of many companies, which does not allow all these things. Another important point, although this is still far from being achieved in most institutions, is allowing open peer review, because it allows us to know who has done a review, with what comments, etc. It can be said that it allows the peer review cycle to be redone and improved. A final point is citizen science: allowing ordinary citizens to be part of science, not only within universities or research institutes.
And another important point is adding new ways of measuring the quality of science

Mireia Alcalá:. I agree with what Juan says. I would also like to add that, for an investigation process to be considered open, we have to look at it globally. That is, include the entire data lifecycle. We cannot talk about a science being open if we only look at whether the data at the end is open. Already at the beginning of the whole data lifecycle, it is important to use platforms and work in a more open and collaborative way.

3 Why is it important for universities and research centres to make their studies and data available to the public?

Mireia Alcalá:. I think it is key that universities and centres share their studies, because a large part of research, both here in Spain and at European and world level, is funded with public money. Therefore, if society is paying for the research, it is only logical that it should also benefit from its results. In addition, opening up the research process can help make it more transparent, more accountable, etc. Much of the research done to date has been found to be neither reusable nor reproducible. What does this mean? That the studies that have been done, almost 80% of the time someone else can't take it and reuse that data. Why? Because they don't follow the same standards, the same mannersand so on. So, I think we have to make it extensive everywhere and a clear example is in times of pandemics. With COVID-19, researchers from all over the world worked together, sharing data and findings in real time, working in the same way, and science was seen to be much faster and more efficient.


Juan Corrales: The key points have already been touched upon by Mireia. Besides, it could be added that bringing science closer to society can make all citizens feel that science is something that belongs to us, not just to scientists or academics. It is something we can participate in and this can also help to perhaps stop hoaxes, fake news, to have a more exhaustive vision of the news that reaches us through social networks and to be able to filter out what may be real and what may be false.

4.What research should be published openly?

Juan Corrales: Right now, according to the law we have in Spain, the latest Law of science, all publications that are mainly financed by public funds or in which public institutions participatemust be published in open access. This has not really had much repercussion until last year, because, although the law came out two years ago, the previous law also said so, there is also a law of the Community of Madrid that says the same thing.... but since last year it is being taken into account in the evaluation that the ANECA (the Quality Evaluation Agency) does on researchers.. Since then, almost all researchers have made it a priority to publish their data and research openly. Above all, data was something that had not been done until now.

Mireia Alcalá: At the state level it is as Juan says. We at the regional level also have a law from 2022, the Law of science, which basically says exactly the same as the Spanish law. But I also like people to know that we have to take into account not only the state legislation, but also the calls for proposals from where the money to fund the projects comes from. Basically in Europe, in framework programmes such as Horizon Europe, it is clearly stated that, if you receive funding from the European Commission, you will have to make a data management plan at the beginning of your research and publish the data following the FAIR principles.

 

5.Among other issues, both CSUC and Consorcio Madroño are in charge of supporting entities and researchers who want to make their data available to the public. How should a process of opening research data be? What are the most common challenges and how do you solve them?


Mireia Alcalá: In our repository, which is called RDR (from Repositori de Dades de Recerca), it is basically the participating institutions that are in charge of supporting the research staff.. The researcher arrives at the repository when he/she is already in the final phase of the research and needs to publish the data yesterday, and then everything is much more complex and time consuming. It takes longer to verify this data and make it findable, accessible, interoperable and reusable.
In our particular case, we have a checklist that we require every dataset to comply with to ensure this minimum data quality, so that it can be reused. We are talking about having persistent identifiers such as ORCID for the researcher or ROR to identify the institutions, having documentation explaining how to reuse that data, having a licence, and so on. Because we have this checklist, researchers, as they deposit, improve their processes and start to work and improve the quality of the data from the beginning. It is a slow process. The main challenge, I think, is that the researcher assumes that what he has is data, because most of them don't know it. Most researchers think of data as numbers from a machine that measures air quality, and are unaware that data can be a photograph, a film from an archaeological excavation, a sound captured in a certain atmosphere, and so on. Therefore, the main challenge is for everyone to understand what data is and that their data can be valuable to others.
And how do we solve it? Trying to do a lot of training, a lot of awareness raising. In recent years, the Consortium has worked to train data curation staff, who are dedicated to helping researchers directly refine this data. We are also starting to raise awareness directly with researchers so that they use the tools and understand this new paradigm of data management.

Juan Corrales: In the Madroño Consortium, until November, the only way to open data was for researchers to pass a form with the data and its metadata to the librarians, and it was the librarians who uploaded it to ensure that it was FAIR. Since November, we also allow researchers to upload data directly to the repository, but it is not published until it has been reviewed by expert librarians, who verify that the data and metadata are of high quality. It is very important that the data is well described so that it can be easily found, reusable and identifiable.

As for the challenges, there are all those mentioned by Mireia - that researchers often do not know they have data - and also, although ANECA has helped a lot with the new obligations to publish research data, many researchers want to put their data running in the repositories, without taking into account that they have to be quality data, that it is not enough to put them there, but that it is important that these data can be reused later.

6.What activities and tools do you or similar institutions provide to help organisations succeed in this task?

Juan Corrales: From Consorcio Madroño, the repository itself that we use, the tool where the research data is uploaded, makes it easy to make the data FAIR, because it already provides unique identifiers, fairly comprehensive metadata templates that can be customised, and so on. We also have another tool that helps create the data management plans for researchers, so that before they create their research data, they start planning how they're going to work with it. This is very important and has been promoted by European institutions for a long time, as well as by the Science Act and the National Open Science Strategy.
Then, more than the tools, the review by expert librarians is also very important. There are other tools that help assess the quality of adataset, of research data, such as Fair EVA or F-Uji, but what we have found is that those tools at the end what they are evaluating more is the quality of the repository, of the software that is being used, and of the requirements that you are asking the researchers to upload this metadata, because all our datasets have a pretty high and quite similar evaluation. So what those tools do help us with is to improve both the requirements that we're putting on our datasets, on our datasets, and to be able to improve the tools that we have, in this case the Dataverse software, which is the one we are using.


Mireia Alcalá: At the level of tools and activities we are on a par, because we have had a relationship with the Madroño Consortium for years, and just like them we have all these tools that help and facilitate putting the data in the best possible way right from the start, for example, with the tool for making data management plans. Here at CSUC we have also been working very intensively in recent years to close this gap in the data life cycle, covering issues of infrastructures, storage, cloud, etc. so that, when the data is analysed and managed, researchers also have a place to go. After the repository, we move on to all the channels and portals that make it possible to disseminate and make all this science visible, because it doesn't make sense for us to make repositories and they are there in a silo, but they have to be interconnected. For many years now, a lot of work has been done on making interoperability protocols and following the same standards. Therefore, data has to be available elsewhere, and both Consorcio Madroño and we are everywhere possible and more.



7. Can you tell us a bit more about these repositories you offer? In addition to helping researchers to make their data available to the public, you also offer a space, a digital repository where this data can be housed, so that it can be located by users.
 

Mireia Alcalá: If we are talking specifically about research data, as we and Consorcio Madroño have the same repository, we are going to let Juan explain the software and specifications, and I am going to focus on other repositories of scientific production that CSUC also offers. Here what we do is coordinate different cooperative repositories according to the type of resource they contain.  So, we have TDX for thesis, RECERCAT for research papers, RACO for scientific journals or MACO, for open access monographs. Depending on the type of product, we have a specific repository, because not everything can be in the same place, as each output of the research has different particularities. Apart from the repositories, which are cooperative, we also have other spaces that we make for specific institutions, either with a more standard solution or some more customised functionalities. But basically it is this: we have for each type of output that there is in the research, a specific repository that adapts to each of the particularities of these formats.


Juan Corrales: In the case of Consorcio Madroño, our repository is called e-scienceData, but it is based on the same software as the CSUC repository, which is Dataverse.. It is open source software, so it can be improved and customised. Although in principle the development is managed from Harvard University in the United States, institutions from all over the world are participating in its development - I don't know if thirty-odd countries have already participated in its development.
 Among other things, for example,  the translations into Catalan have been done by CSUC, the translation into Spanish has been done by Consorcio Madroño and we have also participated in other small developments. The advantage of this software is that it makes it much easier for the data to be FAIR and compatible with other points that have much more visibility, because, for example, the CSUC is much larger, but in the Madroño Consortium there are six universities, and it is rare that someone goes to look for a dataset in the Madroño Consortium, in e-scienceData, directly. They usually search for it via Google or a European or international portal. With these facilities that Dataverse has, they can search for it from anywhere and they can end up finding the data that we have at Consorcio Madroño or at CSUC.

 

8. What other platforms with open research data, at Spanish or European level, do you recommend?


Juan Corrales:  For example, at the Spanish level there is the FECYT, the Spanish Foundation for Science and Technology, which has a box that collects the research data of all Spanish institutions practically. All the publications of all the institutions appear there: Consorcio Madroño, CSUC and many more.
Then, specifically for research data, there is a lot of research that should be put in a thematic repository, because that's where researchers in that branch of science are going to look. We have a tool to help choose the thematic repository. At the European level there is Zenodo, which has a lot of visibility, but does not have the data quality support of CSUC or the Madroño Consortium. And that is something that is very noticeable in terms of reuse afterwards.


Mireia Alcalá: At the national level, apart from Consorcio Madroño's and our own initiatives, data repositories are not yet widespread. We are aware of some initiatives under development, but it is still too early to see their results. However, I do know of some universities that have adapted their institutional repositories so that they can also add data. And while this is a valid solution for those who have no other choice, it has been found that software used in repositories that are not designed to handle the particularities of the data - such as heterogeneity, format, diversity, large size, etc. - are a bit lame. Then, as Juan said, at the European level, it is established that Zenodo is the multidisciplinary and multiformat repository, which was born as a result of a European project of the Commission. I agree with him that, as it is a self-archiving and self-publishing repository - that is, I Mireia Alcalá can go there in five minutes, put any document I have there, nobody has looked at it, I put the minimum metadata they ask me for and I publish it -, it is clear that the quality is very variable. There are some things that are really usable and perfect, but there are others that need a little more TLC. As Juan said, also at the disciplinary level it is important to highlight that, in all those areas that have a disciplinary repository, researchers have to go there, because that is where they will be able to use their most appropriate metadata, where everybody will work in the same way, where everybody will know where to look for those data.... For anyone who is interested there is a directory called re3data, which is basically a directory of all these multidisciplinary and disciplinary repositories. It is therefore a good place for anyone who is interested and does not know what is in their discipline. Let him go there, he is a good resource.

9. What actions do you consider to be priorities for public institutions in order to promote open knowledge?


Mireia Alcalá: What I would basically say is that public institutions should focus on making and establishing clear policies on open science, because it is true that we have come a long way in recent years, but there are times when researchers are a bit bewildered. And apart from policies, it is above all offering incentives to the entire research community, because there are many people who are making the effort to change their way of working to become immersed in open science and sometimes they don't see how all that extra effort they are making to change their way of working to do it this way pays off. So I would say this: policies and incentives.


Juan Corrales: From my point of view, the theoretical policies that we already have at the national level, at the regional level, are usually quite correct, quite good. The problem is that often no attempt has been made to enforce them. So far, from what we have seen for example with ANECA - which has promoted the use of data repositories or research article repositories - they have not really started to be used on a massive scale. In other words, incentives are necessary, and not just a matter of obligation. As Mireia has also said, we have to convince researchers to see open publishing as theirs, as it is something that benefits both them and society as a whole. What I think is most important is that: the awareness of researchers

Suscribe to our Spotify profile

Interview clips

  1. Why should universities and researchers share their studies in open formats?

2. What requirements must an investigation meet in order to be considered open?

calendar icon
Entrevista

Did you know that data science skills are among the most in-demand skills in business? In this podcast, we are going to tell you how you can train yourself in this field, in a self-taught way. For this purpose, we will have two experts in data science:

  • Juan Benavente, industrial and computer engineer with more than 12 years of experience in technological innovation and digital transformation. In addition, it has been training new professionals in technology schools, business schools and universities for years.
  • Alejandro Alija, PhD in physics, data scientist and expert in digital transformation.  In addition to his extensive professional experience focused on the Internet of Things (internet of things), Alejandro also works as a lecturer in different business schools and universities.

 

Listen to the podcast (in spanish)

Summary of the interview

  1. What is data science? Why is it important and what can it do for us? 

Alejandro Alija: Data science could be defined as a discipline whose main objective is to understand the world, the processes of business and life, by analysing and observing data.Data science is a discipline whose main objective is to understand the world, the processes of business and life, by analysing and observing the data.. In the last 20 years it has gained exceptional relevance due to the explosion in data generation, mainly due to the irruption of the internet and the connected world.

Juan Benavente:  The term data science has evolved since its inception. Today, a data scientist is the person who is working at the highest level in data analysis, often associated with the building of machine learning or artificial intelligence algorithms for specific companies or sectors, such as predicting or optimising manufacturing in a plant.

The profession is evolving rapidly, and is likely to fragment in the coming years. We have seen the emergence of new roles such as data engineers or MLOps specialists. The important thing is that today any professional, regardless of their field, needs to work with data. There is no doubt that any position or company requires increasingly advanced data analysis. It doesn't matter if you are in marketing, sales, operations or at university. Anyone today is working with, manipulating and analysing data. If we also aspire to data science, which would be the highest level of expertise, we will be in a very beneficial position. But I would definitely recommend any professional to keep this on their radar.

  1. How did you get started in data science and what do you do to keep up to date? What strategies would you recommend for both beginners and more experienced profiles?

Alejandro Alija: My basic background is in physics, and I did my PhD in basic science. In fact, it could be said that any scientist, by definition, is a data scientist, because science is based on formulating hypotheses and proving them with experiments and theories. My relationship with data started early in academia. A turning point in my career was when I started working in the private sector, specifically in an environmental management company that measures and monitors air pollution. The environment is a field that is traditionally a major generator of data, especially as it is a regulated sector where administrations and private companies are obliged, for example, to record air pollution levels under certain conditions. I found historical series up to 20 years old that were available for me to analyse. From there my curiosity began and I specialised in concrete tools to analyse and understand what is happening in the world.

Juan Benavente: I can identify with what Alejandro said because I am not a computer scientist either. I trained in industrial engineering and although computer science is one of my interests, it was not my base. In contrast, nowadays, I do see that more specialists are being trained at the university level.  A data scientist today has manyskills on their back such as statistics, mathematics and the ability to understand everything that goes on in the industry. I have been acquiring this knowledge through practice. On how to keep up to date, I think that, in many cases, you can be in contact with companies that are innovating in this field. A lot can also be learned at industry or technology events. I started in the smart cities and have moved on to the industrial world to learn little by little.

Alejandro Alija:. To add another source to keep up to date. Apart from what Juan has said, I think it's important to identify what we call outsiders, the manufacturers of technologies, the market players.  They are a very useful source of information to stay up to date: identify their futures strategies and what they are betting on.

  1. If someone with little or no technical knowledge wants to learn data science, where do they start?

Juan Benavente:  In training, I have come across very different profiless: from people who have just graduated from university to profiles that have been trained in very different fields and find in data science an opportunity to transform themselves and dedicate themselves to this.  Thinking of someone who is just starting out, I think the best thing to do is put your knowledge into practice. In projects I have worked on, we defined the methodology in three phases: a first phase of more theoretical aspects, taking into account mathematics, programming and everything a data scientist needs to know; once you have those basics, the sooner you start working and practising those skills, the better. I believe that skill sharpens the wit and, both to keep up to date and to train yourself and acquire useful knowledge, the sooner you enter into a project, the better. And even more so in a world that is so frequently updated. In recent years, the emergence of the Generative AI has brought other opportunities.  There are also opportunities for new profiles who want to be trained . Even if you are not an expert in programming, you have tools that can help you with programming, and the same can happen in mathematics or statistics.

Alejandro Alija:. To complement what Juan says from a different perspective. I think it is worth highlighting the evolution of the data science profession.. I remember when that paper about "the sexiest profession in the world" became famous and went viral, but then things adjusted. The first settlers in the world of data science did not come so much from computer science or informatics. There were more outsiders: physicists, mathematicians, with a strong background in mathematics and physics, and even some engineers whose work and professional development meant that they ended up using many tools from the computer science field. Gradually, it has become more and more balanced. It is now a discipline that continues to have those two strands: people who come from the world of physics and mathematics towards the more basic data, and people who come with programming skills. Everyone knows what they have to balance in their toolbox. Thinking about a junior profile who is just starting out, I think a very important thing - and we see this when we teach - is programming skills. I would say that having programming skills is not just a plus, but a basic requirement for advancement in this profession. It is true that some people can do well without a lot of programming skills, but I would argue that a beginner needs to have those first programming skills with a basic toolset . We're talking about languages such as Python and R, which are the headline languages. You don't need to be a great coder, but you do need to have some basic knowledge to get started. Then, of course, specific training in the mathematical foundations of data science is crucial. The fundamental statistics and more advanced statistics are complements that, if present, will move a person along the data science learning curve much faster. Thirdly, I would say that specialisation in particular tools is important. Some people are more oriented towards data engineering, others towards the modelling world. Ideally, specialise in a few frameworks and use them together, as optimally as possible.

  1. In addition to teaching, you both work in technology companies. What technical certifications are most valued in the business sector and what open sources of knowledge do you recommend to prepare for them?

Juan Benavente: Personally, it's not what I look at most, but I think it can be relevant, especially for people who are starting out and need help in structuring their approach to the problem and understanding it. I recommend certifications of technologies that are in use in any company where you want to end up working. Especially from providers of cloud computing and widespread data analytics tools. These are certifications that I would recommend for someone who wants to approach this world and needs a structure to help them. When you don't have a knowledge base, it can be a bit confusing to understand where to start. Perhaps you should reinforce programming or mathematical knowledge first, but it can all seem a bit complicated. Where these certifications certainly help you is, in addition to reinforcing concepts, to ensure that you are moving well and know the typical ecosystem of tools you will be working with tomorrow. It is not just about theoretical concepts, but about knowing the ecosystems that you will encounter when you start working, whether you are starting your own company or working in an established company. It makes it much easier for you to get to know the typical ecosystem of tools. Call it Microsoft Computing, Amazon or other providers of such solutions. This will allow you to focus more quickly on the work itself, and less on all the tools that surround it. I believe that this type of certification is useful, especially for profiles that are approaching this world with enthusiasm. It will help them both to structure themselves and to land well in their professional destination. They are also likely to be valued in selection processes.

Alejandro Alija: If someone listens to us and wants more specific guidelines, it could be structured in blocks. There are a series of massive online courses that, for me, were a turning point. In my early days, I tried to enrol in several of these courses on platforms such as Coursera, edX, where even the technology manufacturers themselves design these courses. I believe that this kind of massive, self-service, online courses provide a good starting base. A second block would be the courses and certifications of the big technology providers, such as Microsoft, Amazon Web Services, Google and other platforms that are benchmarks in the world of data. These companies have the advantage that their learning paths are very well structured, which facilitates professional growth within their own ecosystems. Certifications from different suppliers can be combined. For a person who wants to go into this field, the path ranges from the simplest to the most advanced certifications, such as being a data solutions architect or a specialist in a specific data analytics service or product. These two learning blocks are available on the internet, most of them are open and free or close to free. Beyond knowledge, what is valued is certification, especially in companies looking for these professional profiles.

  1. In addition to theoretical training, practice is key, and one of the most interesting methods of learning is to replicate exercises step by step. In this sense, from datos.gob.es we offer didactic resources, many of them developed by you as experts in the project, can you tell us what these exercises consist of?. How are they approached?

Alejandro Alija: The approach we always took was designed for a broad audience, without complex prerequisites. We wanted any user of the portal to be able to replicate the exercises, although it is clear that the more knowledge you have, the more you can use it to your advantage. Exercises have a well-defined structure: a documentary section, usually a content post or a report describing what the exercise consists of, what materials are needed, what the objectives are and what it is intended to achieve. In addition, we accompany each exercise with two additional resources. The first resource is a code repository where we upload the necessary materials, with a brief description and the code of the exercise. It can be a Python notebook , a Jupyter Notebook or a simple script, where the technical content is. And then another fundamental element that we believe is important and that is aimed at facilitating the execution of the exercises. In data science and programming, non-specialist users often find it difficult to set up a working environment. A Python exercise, for example, requires having a programming environment installed, knowing the necessary libraries and making configurations that are trivial for professionals, but can be very complex for beginners. To mitigate this barrier, we publish most of our exercises on Google Colab, a wonderful and open tool. Google Colab is a web programming environment where the user only needs a browser to access it. Basically, Google provides us with a virtual computer where we can run our programmes and exercises without the need for special configurations. The important thing is that the exercise is ready to use and we always check it in this environment, which makes it much easier to learn for beginners or less technically experienced users.

Juan Benavente: Yes, we always take a user-oriented approach, step by step, trying to make it open and accessible. The aim is for anyone to be able to run an exercise without the need for complex configurations, focusing on topics as close to reality as possible. We often take advantage of open data published by entities such as the DGT or other bodies to make realistic analyses. We have developed very interesting exercises, such as energy market predictions, analysis of critical materials for batteries and electronics, which allow learning not only about technology, but also about the specific subject matter.. You can get down to work right away, not only to learn, but also to find out about the subject.

  1. In closing, we'd like you to offer a piece of advice that is more attitude-oriented than technical, what would you say to someone starting out in data science?

Alejandro Alija:  As for an attitude tip for someone starting out in data science, I suggest be brave. There is no need to worry about being unprepared, because in this field everything is to be done and anyone can contribute value. Data science is multi-faceted: there are professionals closer to the business world who can provide valuable insights, and others who are more technical and need to understand the context of each area. My advice is to be content with the resources available without panicking, because, although the path may seem complex, the opportunities are very high. As a technical tip, it is important to be sensitive to the development and use of data. The more understanding one has of this world, the smoother the approach to projects will be.

Juan Benavente: I endorse the advice to be brave and add a reflection on programming: many people find the theoretical concept attractive, but when they get to practice and see the complexity of programming, some are discouraged by lack of prior knowledge or different expectations. It is important to add the concepts of patience and perseverance. When you start in this field, you are faced with multiple areas that you need to master: programming, statistics, mathematics, and specific knowledge of the sector you will be working in, be it marketing, logistics or another field. The expectation of becoming an expert quickly is unrealistic. It is a profession that, although it can be started without fear and by collaborating with professionals, requires a journey and a learning process. You have to be consistent and patient, managing expectations appropriately. Most people who have been in this world for a long time agree that they have no regrets about going into data science. It is a very attractive profession where you can add significant value, with an important technological component. However, the path is not always straightforward. There will be complex projects, moments of frustration when analyses do not yield the expected results or when working with data proves more challenging than expected. But looking back, few professionals regret having invested time and effort in training and developing in this field. In summary, the key tips are: courage to start, perseverance in learning and development of programming skills.

Interview clips

1. Is it worth studying data science?

2. How are the data science exercises on datos.gob.es approached?

3. What is data science? What skills are required?

calendar icon