Do you know why it is so important to categorize datasets? Do you know the references that exist to do it according to the global, European and national standards? In this podcast we tell you the keys to categorizing datasets and guide you to do it in your organization.
- David Portolés, Project Manager of the Advisory Service
- Manuel Ángel Jáñez, Senior Data Expert
Listen to the full podcast (Spanish)
Summary of the interview
-
What do we mean when we talk about cataloguing data and why is it so important to do so?
David Portolés: When we talk about cataloguing data, what we want is to describe it in a structured way. In other words, we talk about metadata: information related to data. Why is it so important? Because thanks to this metadata, interoperability is achieved. This word may sound complicated, but it simply means that systems can communicate with each other autonomously.
Manuel Ángel Jañez: Exactly, as David says, categorizing is not just labeling. It is about providing data with properties that make it understandable, accessible and reusable. For that we need agreements or standards. If each producer defines their own rules, consumers will not be able to interpret them correctly, and value is lost. Categorizing is reaching consensus between the general and the specific, and this is not new: it is an evolution of library documentation, adapted to the digital environment.
-
So we understand that interoperability is speaking the same language to get the most out of it. What references are there at global, European and national level?
Manuel Ángel Jáñez: The way to describe data is in an open way, using standards or reference specifications, of frames.
-
Globally: DCAT (a W3C recommendation) allows you to model catalogs, datasets, distributions, services, etc. In essence, all the entities that are key and that are then reused in the rest of the profiles.
-
At the European level: DCAT-AP, the application profile in data portals in the European Union, particularly those corresponding to the public sector. This is essentially what is used for the Spanish profile, DCAT-AP-ES.
-
In Spain: DCAT-AP-ES, is the context in which more specific restrictions are incorporated at the Spanish level. It is a profile based on the 2013 Technical Standard for Interoperability (NTI). This profile adds new features, evolves the model to make it compatible with the European standard, adds features related to high-value sets (HVDs), and adapts the standard to the present of the data ecosystem.
David Portolés: With a good description, the reuser can search, retrieve and locate the datasets that are of interest to them and, on the other hand, discover other new datasets that they had not contemplated. The standards, the models, the shared vocabularies. The main difference between them is the degree of detail they apply. The key is to reach the compromise between being as general as possible so that they are not restrictive, but, on the other hand, it is necessary to be specific, it is necessary that they are also specific. Although we talk a lot about open data, these standards also apply to protected data that can be described. The universe of application of these standards is very broad.
-
Focusing on DCAT-AP-ES, what help or resources are there for a user to implement it?
David Portolés: DCAT-AP-ES is a set of rules and basic application models. Like any technical standard, it has an application guide and, in addition, there is an online implementation guide with examples, conventions, frequently asked questions and spaces for technical and informative discussion. This guide has a very clear purpose, the idea is to create a community around this technical standard, with the purpose of generating a knowledge base accessible to all, a transparent and open support channel for anyone who wants to participate.
Manuel Ángel Jañez: The available resources do not start from zero. Everything is aligned with European initiatives such as SEMIC, which promotes semantic interoperability in the EU. We want a living and dynamic tool that evolves with the needs, under a participatory approach, with good practices, debates, harmonisation of the profile, etc. In short, the aim is for the model to be useful, robust, easy to maintain over time and flexible enough so that anyone can participate in its improvement.
-
Is there any existing thematic implementation in DCAT-AP-ES?
Manuel Ángel Jáñez: Yes, important steps have been taken in that direction. For example, the model of high-value sets has already been included, key for data relevant to the economy or society, useful for AI, for example. DCAT-AP-ES is inspired by profiles such as DCAT-AP v2.1.1 (2022) that incorporates some semantic improvements, but there are still thematic implementations to be incorporated into DCAT-AP-ES, such as data series. The idea is that thematic extensions will enable modelling for specific datasets.
David Portolés: As Manu says, the idea is that he is a living model. Future possible extensions are:
- Geographical data: GeoDCAT-AP (European).
- Statistical data: StatDCAT-AP.
In addition, future directives on high-value data will have to be taken into account.
-
And what are the next objectives for the development of DCAT-AP-ES?
David Portolés: The main objective is to achieve full adoption by:
-
Vendors: To change the way they offer and disseminate their metadata relative to their datasets with this new paradigm
-
Reusers: that integrate the new profile in their developments, in their systems, and in all the integrations they have made so far, and that they can make much better derivative products.
Manuel Ángel Jáñez: Also to maintain coherence with international standards such as DCAT-AP. We want to continue to be committed to an agile, participatory technical governance model aligned with emerging technologies (such as protected data, sovereign data infrastructures and data spaces). In short: that DCAT-AP-ES is useful, flexible and future-pr
Collaborative culture and citizen open data projects are key to democratic access to information. This contributes to free knowledge that allows innovation to be promoted and citizens to be empowered.
In this new episode of the datos.gob.es podcast, we are joined by two professionals linked to citizen projects that have revolutionized the way we access, create and reuse knowledge. We welcome:
- Florencia Claes, professor and coordinator of Free Culture at the Rey Juan Carlos University, and former president of Wikimedia Spain.
- Miguel Sevilla-Callejo, researcher at the CSIC (Spanish National Research Council) and Vice-President of the OpenStreetMap Spain association.
Listen the episode (in spanish)
-
How would you define free culture?
Florencia Claes: It is any cultural, scientific, intellectual expression, etc. that as authors we allow any other person to use, take advantage of, reuse, intervene in and relaunch into society, so that another person does the same with that material.
In free culture, licenses come into play, those permissions of use that tell us what we can do with those materials or with those expressions of free culture.
-
What role do collaborative projects have within free culture?
Miguel Sevilla-Callejo: Having projects that are capable of bringing together these free culture initiatives is very important. Collaborative projects are horizontal initiatives in which anyone can contribute. A consensus is structured around them to make that project, that culture, grow.
-
You are both linked to collaborative projects such as Wikimedia and OpenStreetMap. How do these projects impact society?
Florencia Claes: Clearly the world would not be the same without Wikipedia. We cannot conceive of a world without Wikipedia, without free access to information. I think Wikipedia is associated with the society we are in today. It has built what we are today, also as a society. The fact that it is a collaborative, open, free space, means that anyone can join and intervene in it and that it has a high rigor.
So, how does it impact? It impacts that (it will sound a little cheesy, but...) we can be better people, we can know more, we can have more information. It has an impact on the fact that anyone with access to the internet, of course, can benefit from its content and learn without necessarily having to go through a paywall or be registered on a platform and change data to be able to appropriate or approach the information.
Miguel Sevilla-Callejo: We call OpenStreetMap the Wikipedia of maps, because a large part of its philosophy is copied or cloned from the philosophy of Wikipedia. If you imagine Wikipedia, what people do is they put encyclopedic articles. What we do in OpenStreetMap is to enter spatial data. We build a map collaboratively and this assumes that the openstreetmap.org page, which is where you could go to look at the maps, is just the tip of the iceberg. That's where OpenStreetMap is a little more diffuse and hidden, but most of the web pages, maps and spatial information that you are seeing on the Internet, most likely in its vast majority, comes from the data of the great free, open and collaborative database that is OpenStreetMap.
Many times you are reading a newspaper and you see a map and that spatial data is taken from OpenStreetMap. They are even used in agencies: in the European Union, for example, OpenStreetMap is being used. It is used in information from private companies, public administrations, individuals, etc. And, in addition, being free, it is constantly reused.
I always like to bring up projects that we have done here, in the city of Zaragoza. We have generated the entire urban pedestrian network, that is, all the pavements, the zebra crossings, the areas where you can circulate... and with this a calculation is made of how you can move around the city on foot. You can't find this information on sidewalks, crosswalks and so on on a website because it's not very lucrative, such as getting around by car, and you can take advantage of it, for example, which is what we did in some jobs that I directed at university, to be able to know how different mobility is with blind people. in a wheelchair or with a baby carriage.
-
You are telling us that these projects are open. If a citizen is listening to us right now and wants to participate in them, what should they do to participate? How can you be part of these communities?
Florencia Claes: The interesting thing about these communities is that you don't need to be formally associated or linked to them to be able to contribute. In Wikipedia you simply enter the Wikipedia page and become a user, or not, and you can edit. What is the difference between making your username or not? In that you will be able to have better access to the contributions you have made, but we do not need to be associated or registered anywhere to be able to edit Wikipedia.
If there are groups at the local or regional level related to the Wikimedia Foundation that receive grants and grants to hold meetings or activities. That's good, because you meet people with the same concerns and who are usually very enthusiastic about free knowledge. As my friends say, we are a bunch of geeks who have met and feel that we have a group of belonging in which we share and plan how to change the world.
Miguel Sevilla-Callejo: In OpenStreetMap it is practically the same, that is, you can do it alone. It is true that there is a bit of a difference with respect to Wikipedia. If you go to the openstreetmap.org page, where we have all the documentation – which is wiki.OpenStreetMap.org – you can go there and you have all the documentation.
It is true that to edit in OpenStreetMap you do need a user to better track the changes that people make to the map. If it were anonymous there could be more of a problem, because it is not like the texts in Wikipedia. But as Florencia said, it's much better if you associate yourself with a community.
We have local groups in different places. One of the initiatives that we have recently reactivated is the OpenStreetMap Spain association, in which, as Florencia said, we are a group of those who like data and free tools, and there we share all our knowledge. A lot of people come up to us and say "hey, I just entered OpenStreetMap, I like this project, how can I do this? How can I do the other?"And well, it's always much better to do it with other colleagues than to do it alone. But anyone can do it.
-
What challenges have you encountered when implementing these collaborative projects and ensuring their sustainability over time? What are the main challenges, both technical and social, that you face?
Miguel Sevilla-Callejo: One of the problems we find in all these movements that are so horizontal and in which we have to seek consensus to know where to move forward, is that in the end it is relatively problematic to deal with a very diverse community. There is always friction, different points of view... I think this is the most problematic thing. What happens is that, deep down, as we are all moved by enthusiasm for the project, we end up reaching agreements that make the project grow, as can be seen in Wikimedia and OpenStreetMap themselves, which continue to grow and grow.
From a technical point of view, for some things in particular, you have to have a certain computer prowess, but we are very, very basic. For example, we have made mapathons, which consist of us meeting in an area with computers and starting to put spatial information in areas, for example, where there has been a natural disaster or something like that. Basically, on a satellite image, people place little houses where they see - little houses there in the middle of the Sahel, for example, to help NGOs such as Doctors Without Borders. That's very easy: you open it in the browser, open OpenStreetMap and right away, with four prompts, you're able to edit and contribute.
It is true that, if you want to do things that are a little more complex, you have to have more computer skills. So it is true that we always adapt. There are people who are entering data in a very pro way, including buildings, importing data from the cadastre... and there are people like a girl here in Zaragoza who recently discovered the project and is entering the data they find with an application on their mobile phone.
I do really find a certain gender bias in the project. That within OpenStreetMap worries me a little, because it is true that a large majority of the people we are editing, including the community, are men and that in the end does mean that some data has a certain bias. But hey, we're working on it.
Florencia Claes: In that sense, in the Wikimedia environment, that also happens to us. We have, more or less worldwide, 20% of women participating in the project against 80% of men and that means that, for example, in the case of Wikipedia, there is a preference for articles about footballers sometimes. It is not a preference, but simply that the people who edit have those interests and as they are more men, we have more footballers, and we miss articles related, for example, to women's health.
So we do face biases and we face that coordination of the community. Sometimes people with many years participate, new people... and achieving a balance is very important and very difficult. But the interesting thing is when we manage to keep in mind or remember that the project is above us, that we are building something, that we are giving something away, that we are participating in something very big. When we become aware of that again, the differences calm down and we focus again on the common good which, after all, I believe is the goal of these two projects, both in the Wikimedia environment and OpenStreetMap.
-
As you mentioned, both Wikimedia and OpenStreetMap are projects built by volunteers. How do you ensure data quality and accuracy?
Miguel Sevilla-Callejo: The interesting thing about all this is that the community is very large and there are many eyes watching. When there is a lack of rigor in the information, both in Wikipedia – which people know more about – but also in OpenStreetMap, alarm bells go off. We have tracking systems and it's relatively easy to see dysfunctions in the data. Then we can act quickly. This gives a capacity, in OpenStreetMap in particular, to react and update the data practically immediately and to solve those problems that may arise also quite quickly. It is true that there has to be a person attentive to that place or that area.
I've always liked to talk about OpenStreetMap data as a kind of - referring to it as it is done in the software - beta map, which has the latest, but there can be some minimal errors. So, as a strongly updated and high-quality map, it can be used for many things, but for others of course not, because we have another reference cartography that is being built by the public administration.
Florencia Claes: In the Wikimedia environment we also work like this, because of the mass, because of the number of eyes that are looking at what we do and what others do. Each one, within this community, is assuming roles. There are roles that are scheduled, such as administrators or librarians, but there are others that simply: I like to patrol, so what I do is keep an eye on new articles and I could be looking at the articles that are published daily to see if they need any support, any improvement or if, on the contrary, they are so bad that they need to be removed from the main part or erased.
The key to these projects is the number of people who participate and everything is voluntary, altruistic. The passion is very high, the level of commitment is very high. So people take great care of those things. Whether data is curated to upload to Wikidata or an article is written on Wikipedia, each person who does it, does it with great affection, with great zeal. Then time goes by and he is aware of that material that he uploaded, to see how it continued to grow, if it was used, if it became richer or if, on the contrary, something was erased.
Miguel Sevilla-Callejo: Regarding the quality of the data, I find interesting, for example, an initiative that the Territorial Information System of Navarre has now had. They have migrated all their data for planning and guiding emergency routes to OpenStreetMap, taking their data. They have been involved in the project, they have improved the information, but taking what was already there [in OpenStreetMap], considering that they had a high quality and that it was much more useful to them than using other alternatives, which shows the quality and importance that this project can have.
-
This data can also be used to generate open educational resources, along with other sources of knowledge. What do these resources consist of and what role do they play in the democratization of knowledge?
Florencia Claes: OER, open educational resources, should be the norm. Each teacher who generates content should make it available to citizens and should be built in modules from free resources. It would be ideal.
What role does the Wikimedia environment have in this? From housing information that can be used when building resources, to providing spaces to perform exercises or to take, for example, data and do work with SPARQL. In other words, there are different ways of approaching Wikimedia projects in relation to open educational resources. You can intervene and teach students how to identify data, how to verify sources, to simply make a critical reading of how information is presented, how it is curated, and make, for example, an assessment between languages.
Miguel Sevilla-Callejo: OpenStreetMap is very similar. What's interesting and unique is what the nature of the data is. It's not exactly information in different formats like in Wikimedia. Here the information is that free spatial database that is OpenStreetMap. So the limits are the imagination.
I remember that there was a colleague who went to some conferences and made a cake with the OpenStreetMap map. He would feed it to the people and say, "See? These are maps that we have been able to eat because they are free." To make more serious or more informal or playful cartography, the limit is only your imagination. It happens exactly the same as with Wikipedia.
-
Finally, how can citizens and organizations be motivated to participate in the creation and maintenance of collaborative projects linked to free culture and open data?
Florencia Claes: I think we have to clearly do what Miguel said about the cake. You have to make a cake and invite people to eat cake. Seriously talking about what we can do to motivate citizens to reuse this data, I believe, especially from personal experience and from the groups with which I have worked on these platforms, that the interface is friendly is a very important step.
In Wikipedia in 2015, the visual editor was activated. The visual editor made us join many more women to edit Wikipedia. Before, it was edited only in code and code, because at first glance it can seem hostile or distant or "that doesn't go with me". So, to have interfaces where people don't need to have too much knowledge to know that this is a package that has such and such data and I'm going to be able to read it with such a program or I'm going to be able to dump it into such and such a thing and make it simple, friendly, attractive... I think that this is going to remove many barriers and that it will put aside the idea that data is for computer scientists. And I think that data goes further, that we can really take advantage of all of them in very different ways. So I think it's one of the barriers that we should overcome.
Miguel Sevilla-Callejo: It didn't happen to us that until about 2015 (forgive me if it's not exactly the date), we had an interface that was quite horrible, almost like the code edition you have in Wikipedia, or worse, because you had to enter the data knowing the labeling, etc. It was very complex. And now we have an editor that basically you're in OpenStreetMap, you hit edit and a super simple interface comes out. You don't even have to put labeling in English anymore, it's all translated. There are many things pre-configured and people can enter the data immediately and in a very simple way. So what that has allowed is that many more people come to the project.
Another very interesting thing, which also happens in Wikipedia, although it is true that it is much more focused on the web interface, is that around OpenStreetMap an ecosystem of applications and services has been generated that has made it possible, for example, to appear mobile applications that, in a very fast, very simple way, allow data to be put directly on foot on the ground. And this makes it possible for people to enter the data in a simple way.
I wanted to stress it again, although I know that we are reiterating all the time in the same circumstance, but I think it is important to comment on it, because I think that we forget that within the projects: we need people to be aware again that data is free, that it belongs to the community, that it is not in the hands of a private company, that it can be modified, that it can be transformed, that behind it there is a community of voluntary, free people, but that this does not detract from the quality of the data, and that it reaches everywhere. So that people come closer and don't see us as a weirdo. I think that Wikipedia is much more integrated into society's knowledge and now with artificial intelligence much more, but it happens to us in OpenStreetMap, that they look at you like saying "but what are you telling me if I use another application on my mobile?" or you're using ours, you're using OpenStreetMap data without knowing it. So we need to get closer to society, to get them to know us better.
Returning to the issue of association, that is one of our objectives, that people know us, that they know that this data is open, that it can be transformed, that they can use it and that they are free to have it to build, as I said before, what they want and the limit is their imagination.
Florencia Claes: I think we should somehow integrate through gamification, through games in the classroom, the incorporation of maps, of data within the classroom, within the day-to-day schooling. I think we would have a point in our favour there. Given that we are within a free ecosystem, we can integrate visualization or reuse tools on the same pages of the data repositories that I think would make everything much friendlier and give a certain power to citizens, it would empower them in such a way that they would be encouraged to use them.
Miguel Sevilla-Callejo: It's interesting that we have things that connect both projects (we also sometimes forget the people of OpenStreetMap and Wikipedia), that there is data that we can exchange, coordinate and add. And that would also add to what you just said.
Subscribe to our Spotify profile to keep up to date with our podcasts
Open knowledge is knowledge that can be reused, shared and improved by other users and researchers without noticeable restrictions. This includes data, academic publications, software and other available resources. To explore this topic in more depth, we have representatives from two institutions whose aim is to promote scientific production and make it available in open access for reuse:
- Mireia Alcalá Ponce de León, Information Resources Technician of the Learning, Research and Open Science Area of the Consortium of University Services of Catalonia (CSUC).
- Juan Corrales Corrillero, Manager of the data repository of the Madroño Consortium.
Listen here the podcast (in spanish)
Summary of the interview
1.Can you briefly explain what the institutions you work for do?
Mireia Alcalá: The CSUC is the Consortium of University Services of Catalonia and is an organisation that aims to help universities and research centres located in Catalonia to improve their efficiency through collaborative projects. We are talking about some 12 universities and almost 50 research centres.
We offer services in many areas: scientific computing, e-government, repositories, cloud administration, etc. and we also offer library and open science services, which is what we are closest to. In the area of learning, research and open science, which is where I am working, what we do is try to facilitate the adoption of new methodologies by the university and research system, especially in open science, and we give support to data management research.
Juan Corrales: The Consorcio Madroño is a consortium of university libraries of the Community of Madrid and the UNED (National University of Distance Education) for library cooperation.. We seek to increase the scientific output of the universities that are part of the consortium and also to increase collaboration between the libraries in other areas. We are also, like CSUC, very involved in open science: in promoting open science, in providing infrastructures that facilitate it, not only for the members of the Madroño Consortium, but also globally. Apart from that, we also provide other library services and create structures for them.
2. What are the requirements for an investigation to be considered open?
Juan Corrales: For research to be considered open there are many definitions, but perhaps one of the most important is given by the National Open Science Strategy, which has six pillars.
One of them is that it is necessary to put in open access both research data and publications, protocols, methodologies.... In other words, everything must be accessible and, in principle, without barriers for everyone, not only for scientists, not only for universities that can pay for access to these research data or publications. It is also important to use open source platforms that we can customise. Open source is software that anyone, in principle with knowledge, can modify, customise and redistribute, in contrast to the proprietary software of many companies, which does not allow all these things. Another important point, although this is still far from being achieved in most institutions, is allowing open peer review, because it allows us to know who has done a review, with what comments, etc. It can be said that it allows the peer review cycle to be redone and improved. A final point is citizen science: allowing ordinary citizens to be part of science, not only within universities or research institutes.
And another important point is adding new ways of measuring the quality of science.
Mireia Alcalá:. I agree with what Juan says. I would also like to add that, for an investigation process to be considered open, we have to look at it globally. That is, include the entire data lifecycle. We cannot talk about a science being open if we only look at whether the data at the end is open. Already at the beginning of the whole data lifecycle, it is important to use platforms and work in a more open and collaborative way.
3 Why is it important for universities and research centres to make their studies and data available to the public?
Mireia Alcalá:. I think it is key that universities and centres share their studies, because a large part of research, both here in Spain and at European and world level, is funded with public money. Therefore, if society is paying for the research, it is only logical that it should also benefit from its results. In addition, opening up the research process can help make it more transparent, more accountable, etc. Much of the research done to date has been found to be neither reusable nor reproducible. What does this mean? That the studies that have been done, almost 80% of the time someone else can't take it and reuse that data. Why? Because they don't follow the same standards, the same mannersand so on. So, I think we have to make it extensive everywhere and a clear example is in times of pandemics. With COVID-19, researchers from all over the world worked together, sharing data and findings in real time, working in the same way, and science was seen to be much faster and more efficient.
Juan Corrales: The key points have already been touched upon by Mireia. Besides, it could be added that bringing science closer to society can make all citizens feel that science is something that belongs to us, not just to scientists or academics. It is something we can participate in and this can also help to perhaps stop hoaxes, fake news, to have a more exhaustive vision of the news that reaches us through social networks and to be able to filter out what may be real and what may be false.
4.What research should be published openly?
Juan Corrales: Right now, according to the law we have in Spain, the latest Law of science, all publications that are mainly financed by public funds or in which public institutions participatemust be published in open access. This has not really had much repercussion until last year, because, although the law came out two years ago, the previous law also said so, there is also a law of the Community of Madrid that says the same thing.... but since last year it is being taken into account in the evaluation that the ANECA (the Quality Evaluation Agency) does on researchers.. Since then, almost all researchers have made it a priority to publish their data and research openly. Above all, data was something that had not been done until now.
Mireia Alcalá: At the state level it is as Juan says. We at the regional level also have a law from 2022, the Law of science, which basically says exactly the same as the Spanish law. But I also like people to know that we have to take into account not only the state legislation, but also the calls for proposals from where the money to fund the projects comes from. Basically in Europe, in framework programmes such as Horizon Europe, it is clearly stated that, if you receive funding from the European Commission, you will have to make a data management plan at the beginning of your research and publish the data following the FAIR principles.
5.Among other issues, both CSUC and Consorcio Madroño are in charge of supporting entities and researchers who want to make their data available to the public. How should a process of opening research data be? What are the most common challenges and how do you solve them?
Mireia Alcalá: In our repository, which is called RDR (from Repositori de Dades de Recerca), it is basically the participating institutions that are in charge of supporting the research staff.. The researcher arrives at the repository when he/she is already in the final phase of the research and needs to publish the data yesterday, and then everything is much more complex and time consuming. It takes longer to verify this data and make it findable, accessible, interoperable and reusable.
In our particular case, we have a checklist that we require every dataset to comply with to ensure this minimum data quality, so that it can be reused. We are talking about having persistent identifiers such as ORCID for the researcher or ROR to identify the institutions, having documentation explaining how to reuse that data, having a licence, and so on. Because we have this checklist, researchers, as they deposit, improve their processes and start to work and improve the quality of the data from the beginning. It is a slow process. The main challenge, I think, is that the researcher assumes that what he has is data, because most of them don't know it. Most researchers think of data as numbers from a machine that measures air quality, and are unaware that data can be a photograph, a film from an archaeological excavation, a sound captured in a certain atmosphere, and so on. Therefore, the main challenge is for everyone to understand what data is and that their data can be valuable to others.
And how do we solve it? Trying to do a lot of training, a lot of awareness raising. In recent years, the Consortium has worked to train data curation staff, who are dedicated to helping researchers directly refine this data. We are also starting to raise awareness directly with researchers so that they use the tools and understand this new paradigm of data management.
Juan Corrales: In the Madroño Consortium, until November, the only way to open data was for researchers to pass a form with the data and its metadata to the librarians, and it was the librarians who uploaded it to ensure that it was FAIR. Since November, we also allow researchers to upload data directly to the repository, but it is not published until it has been reviewed by expert librarians, who verify that the data and metadata are of high quality. It is very important that the data is well described so that it can be easily found, reusable and identifiable.
As for the challenges, there are all those mentioned by Mireia - that researchers often do not know they have data - and also, although ANECA has helped a lot with the new obligations to publish research data, many researchers want to put their data running in the repositories, without taking into account that they have to be quality data, that it is not enough to put them there, but that it is important that these data can be reused later.
6.What activities and tools do you or similar institutions provide to help organisations succeed in this task?
Juan Corrales: From Consorcio Madroño, the repository itself that we use, the tool where the research data is uploaded, makes it easy to make the data FAIR, because it already provides unique identifiers, fairly comprehensive metadata templates that can be customised, and so on. We also have another tool that helps create the data management plans for researchers, so that before they create their research data, they start planning how they're going to work with it. This is very important and has been promoted by European institutions for a long time, as well as by the Science Act and the National Open Science Strategy.
Then, more than the tools, the review by expert librarians is also very important. There are other tools that help assess the quality of adataset, of research data, such as Fair EVA or F-Uji, but what we have found is that those tools at the end what they are evaluating more is the quality of the repository, of the software that is being used, and of the requirements that you are asking the researchers to upload this metadata, because all our datasets have a pretty high and quite similar evaluation. So what those tools do help us with is to improve both the requirements that we're putting on our datasets, on our datasets, and to be able to improve the tools that we have, in this case the Dataverse software, which is the one we are using.
Mireia Alcalá: At the level of tools and activities we are on a par, because we have had a relationship with the Madroño Consortium for years, and just like them we have all these tools that help and facilitate putting the data in the best possible way right from the start, for example, with the tool for making data management plans. Here at CSUC we have also been working very intensively in recent years to close this gap in the data life cycle, covering issues of infrastructures, storage, cloud, etc. so that, when the data is analysed and managed, researchers also have a place to go. After the repository, we move on to all the channels and portals that make it possible to disseminate and make all this science visible, because it doesn't make sense for us to make repositories and they are there in a silo, but they have to be interconnected. For many years now, a lot of work has been done on making interoperability protocols and following the same standards. Therefore, data has to be available elsewhere, and both Consorcio Madroño and we are everywhere possible and more.
7. Can you tell us a bit more about these repositories you offer? In addition to helping researchers to make their data available to the public, you also offer a space, a digital repository where this data can be housed, so that it can be located by users.
Mireia Alcalá: If we are talking specifically about research data, as we and Consorcio Madroño have the same repository, we are going to let Juan explain the software and specifications, and I am going to focus on other repositories of scientific production that CSUC also offers. Here what we do is coordinate different cooperative repositories according to the type of resource they contain. So, we have TDX for thesis, RECERCAT for research papers, RACO for scientific journals or MACO, for open access monographs. Depending on the type of product, we have a specific repository, because not everything can be in the same place, as each output of the research has different particularities. Apart from the repositories, which are cooperative, we also have other spaces that we make for specific institutions, either with a more standard solution or some more customised functionalities. But basically it is this: we have for each type of output that there is in the research, a specific repository that adapts to each of the particularities of these formats.
Juan Corrales: In the case of Consorcio Madroño, our repository is called e-scienceData, but it is based on the same software as the CSUC repository, which is Dataverse.. It is open source software, so it can be improved and customised. Although in principle the development is managed from Harvard University in the United States, institutions from all over the world are participating in its development - I don't know if thirty-odd countries have already participated in its development.
Among other things, for example, the translations into Catalan have been done by CSUC, the translation into Spanish has been done by Consorcio Madroño and we have also participated in other small developments. The advantage of this software is that it makes it much easier for the data to be FAIR and compatible with other points that have much more visibility, because, for example, the CSUC is much larger, but in the Madroño Consortium there are six universities, and it is rare that someone goes to look for a dataset in the Madroño Consortium, in e-scienceData, directly. They usually search for it via Google or a European or international portal. With these facilities that Dataverse has, they can search for it from anywhere and they can end up finding the data that we have at Consorcio Madroño or at CSUC.
8. What other platforms with open research data, at Spanish or European level, do you recommend?
Juan Corrales: For example, at the Spanish level there is the FECYT, the Spanish Foundation for Science and Technology, which has a box that collects the research data of all Spanish institutions practically. All the publications of all the institutions appear there: Consorcio Madroño, CSUC and many more.
Then, specifically for research data, there is a lot of research that should be put in a thematic repository, because that's where researchers in that branch of science are going to look. We have a tool to help choose the thematic repository. At the European level there is Zenodo, which has a lot of visibility, but does not have the data quality support of CSUC or the Madroño Consortium. And that is something that is very noticeable in terms of reuse afterwards.
Mireia Alcalá: At the national level, apart from Consorcio Madroño's and our own initiatives, data repositories are not yet widespread. We are aware of some initiatives under development, but it is still too early to see their results. However, I do know of some universities that have adapted their institutional repositories so that they can also add data. And while this is a valid solution for those who have no other choice, it has been found that software used in repositories that are not designed to handle the particularities of the data - such as heterogeneity, format, diversity, large size, etc. - are a bit lame. Then, as Juan said, at the European level, it is established that Zenodo is the multidisciplinary and multiformat repository, which was born as a result of a European project of the Commission. I agree with him that, as it is a self-archiving and self-publishing repository - that is, I Mireia Alcalá can go there in five minutes, put any document I have there, nobody has looked at it, I put the minimum metadata they ask me for and I publish it -, it is clear that the quality is very variable. There are some things that are really usable and perfect, but there are others that need a little more TLC. As Juan said, also at the disciplinary level it is important to highlight that, in all those areas that have a disciplinary repository, researchers have to go there, because that is where they will be able to use their most appropriate metadata, where everybody will work in the same way, where everybody will know where to look for those data.... For anyone who is interested there is a directory called re3data, which is basically a directory of all these multidisciplinary and disciplinary repositories. It is therefore a good place for anyone who is interested and does not know what is in their discipline. Let him go there, he is a good resource.
9. What actions do you consider to be priorities for public institutions in order to promote open knowledge?
Mireia Alcalá: What I would basically say is that public institutions should focus on making and establishing clear policies on open science, because it is true that we have come a long way in recent years, but there are times when researchers are a bit bewildered. And apart from policies, it is above all offering incentives to the entire research community, because there are many people who are making the effort to change their way of working to become immersed in open science and sometimes they don't see how all that extra effort they are making to change their way of working to do it this way pays off. So I would say this: policies and incentives.
Juan Corrales: From my point of view, the theoretical policies that we already have at the national level, at the regional level, are usually quite correct, quite good. The problem is that often no attempt has been made to enforce them. So far, from what we have seen for example with ANECA - which has promoted the use of data repositories or research article repositories - they have not really started to be used on a massive scale. In other words, incentives are necessary, and not just a matter of obligation. As Mireia has also said, we have to convince researchers to see open publishing as theirs, as it is something that benefits both them and society as a whole. What I think is most important is that: the awareness of researchers
Did you know that data science skills are among the most in-demand skills in business? In this podcast, we are going to tell you how you can train yourself in this field, in a self-taught way. For this purpose, we will have two experts in data science:
- Juan Benavente, industrial and computer engineer with more than 12 years of experience in technological innovation and digital transformation. In addition, it has been training new professionals in technology schools, business schools and universities for years.
- Alejandro Alija, PhD in physics, data scientist and expert in digital transformation. In addition to his extensive professional experience focused on the Internet of Things (internet of things), Alejandro also works as a lecturer in different business schools and universities.
Listen to the podcast (in spanish)
Summary of the interview
-
What is data science? Why is it important and what can it do for us?
Alejandro Alija: Data science could be defined as a discipline whose main objective is to understand the world, the processes of business and life, by analysing and observing data.Data science is a discipline whose main objective is to understand the world, the processes of business and life, by analysing and observing the data.. In the last 20 years it has gained exceptional relevance due to the explosion in data generation, mainly due to the irruption of the internet and the connected world.
Juan Benavente: The term data science has evolved since its inception. Today, a data scientist is the person who is working at the highest level in data analysis, often associated with the building of machine learning or artificial intelligence algorithms for specific companies or sectors, such as predicting or optimising manufacturing in a plant.
The profession is evolving rapidly, and is likely to fragment in the coming years. We have seen the emergence of new roles such as data engineers or MLOps specialists. The important thing is that today any professional, regardless of their field, needs to work with data. There is no doubt that any position or company requires increasingly advanced data analysis. It doesn't matter if you are in marketing, sales, operations or at university. Anyone today is working with, manipulating and analysing data. If we also aspire to data science, which would be the highest level of expertise, we will be in a very beneficial position. But I would definitely recommend any professional to keep this on their radar.
-
How did you get started in data science and what do you do to keep up to date? What strategies would you recommend for both beginners and more experienced profiles?
Alejandro Alija: My basic background is in physics, and I did my PhD in basic science. In fact, it could be said that any scientist, by definition, is a data scientist, because science is based on formulating hypotheses and proving them with experiments and theories. My relationship with data started early in academia. A turning point in my career was when I started working in the private sector, specifically in an environmental management company that measures and monitors air pollution. The environment is a field that is traditionally a major generator of data, especially as it is a regulated sector where administrations and private companies are obliged, for example, to record air pollution levels under certain conditions. I found historical series up to 20 years old that were available for me to analyse. From there my curiosity began and I specialised in concrete tools to analyse and understand what is happening in the world.
Juan Benavente: I can identify with what Alejandro said because I am not a computer scientist either. I trained in industrial engineering and although computer science is one of my interests, it was not my base. In contrast, nowadays, I do see that more specialists are being trained at the university level. A data scientist today has manyskills on their back such as statistics, mathematics and the ability to understand everything that goes on in the industry. I have been acquiring this knowledge through practice. On how to keep up to date, I think that, in many cases, you can be in contact with companies that are innovating in this field. A lot can also be learned at industry or technology events. I started in the smart cities and have moved on to the industrial world to learn little by little.
Alejandro Alija:. To add another source to keep up to date. Apart from what Juan has said, I think it's important to identify what we call outsiders, the manufacturers of technologies, the market players. They are a very useful source of information to stay up to date: identify their futures strategies and what they are betting on.
-
If someone with little or no technical knowledge wants to learn data science, where do they start?
Juan Benavente: In training, I have come across very different profiless: from people who have just graduated from university to profiles that have been trained in very different fields and find in data science an opportunity to transform themselves and dedicate themselves to this. Thinking of someone who is just starting out, I think the best thing to do is put your knowledge into practice. In projects I have worked on, we defined the methodology in three phases: a first phase of more theoretical aspects, taking into account mathematics, programming and everything a data scientist needs to know; once you have those basics, the sooner you start working and practising those skills, the better. I believe that skill sharpens the wit and, both to keep up to date and to train yourself and acquire useful knowledge, the sooner you enter into a project, the better. And even more so in a world that is so frequently updated. In recent years, the emergence of the Generative AI has brought other opportunities. There are also opportunities for new profiles who want to be trained . Even if you are not an expert in programming, you have tools that can help you with programming, and the same can happen in mathematics or statistics.
Alejandro Alija:. To complement what Juan says from a different perspective. I think it is worth highlighting the evolution of the data science profession.. I remember when that paper about "the sexiest profession in the world" became famous and went viral, but then things adjusted. The first settlers in the world of data science did not come so much from computer science or informatics. There were more outsiders: physicists, mathematicians, with a strong background in mathematics and physics, and even some engineers whose work and professional development meant that they ended up using many tools from the computer science field. Gradually, it has become more and more balanced. It is now a discipline that continues to have those two strands: people who come from the world of physics and mathematics towards the more basic data, and people who come with programming skills. Everyone knows what they have to balance in their toolbox. Thinking about a junior profile who is just starting out, I think a very important thing - and we see this when we teach - is programming skills. I would say that having programming skills is not just a plus, but a basic requirement for advancement in this profession. It is true that some people can do well without a lot of programming skills, but I would argue that a beginner needs to have those first programming skills with a basic toolset . We're talking about languages such as Python and R, which are the headline languages. You don't need to be a great coder, but you do need to have some basic knowledge to get started. Then, of course, specific training in the mathematical foundations of data science is crucial. The fundamental statistics and more advanced statistics are complements that, if present, will move a person along the data science learning curve much faster. Thirdly, I would say that specialisation in particular tools is important. Some people are more oriented towards data engineering, others towards the modelling world. Ideally, specialise in a few frameworks and use them together, as optimally as possible.
-
In addition to teaching, you both work in technology companies. What technical certifications are most valued in the business sector and what open sources of knowledge do you recommend to prepare for them?
Juan Benavente: Personally, it's not what I look at most, but I think it can be relevant, especially for people who are starting out and need help in structuring their approach to the problem and understanding it. I recommend certifications of technologies that are in use in any company where you want to end up working. Especially from providers of cloud computing and widespread data analytics tools. These are certifications that I would recommend for someone who wants to approach this world and needs a structure to help them. When you don't have a knowledge base, it can be a bit confusing to understand where to start. Perhaps you should reinforce programming or mathematical knowledge first, but it can all seem a bit complicated. Where these certifications certainly help you is, in addition to reinforcing concepts, to ensure that you are moving well and know the typical ecosystem of tools you will be working with tomorrow. It is not just about theoretical concepts, but about knowing the ecosystems that you will encounter when you start working, whether you are starting your own company or working in an established company. It makes it much easier for you to get to know the typical ecosystem of tools. Call it Microsoft Computing, Amazon or other providers of such solutions. This will allow you to focus more quickly on the work itself, and less on all the tools that surround it. I believe that this type of certification is useful, especially for profiles that are approaching this world with enthusiasm. It will help them both to structure themselves and to land well in their professional destination. They are also likely to be valued in selection processes.
Alejandro Alija: If someone listens to us and wants more specific guidelines, it could be structured in blocks. There are a series of massive online courses that, for me, were a turning point. In my early days, I tried to enrol in several of these courses on platforms such as Coursera, edX, where even the technology manufacturers themselves design these courses. I believe that this kind of massive, self-service, online courses provide a good starting base. A second block would be the courses and certifications of the big technology providers, such as Microsoft, Amazon Web Services, Google and other platforms that are benchmarks in the world of data. These companies have the advantage that their learning paths are very well structured, which facilitates professional growth within their own ecosystems. Certifications from different suppliers can be combined. For a person who wants to go into this field, the path ranges from the simplest to the most advanced certifications, such as being a data solutions architect or a specialist in a specific data analytics service or product. These two learning blocks are available on the internet, most of them are open and free or close to free. Beyond knowledge, what is valued is certification, especially in companies looking for these professional profiles.
-
In addition to theoretical training, practice is key, and one of the most interesting methods of learning is to replicate exercises step by step. In this sense, from datos.gob.es we offer didactic resources, many of them developed by you as experts in the project, can you tell us what these exercises consist of?. How are they approached?
Alejandro Alija: The approach we always took was designed for a broad audience, without complex prerequisites. We wanted any user of the portal to be able to replicate the exercises, although it is clear that the more knowledge you have, the more you can use it to your advantage. Exercises have a well-defined structure: a documentary section, usually a content post or a report describing what the exercise consists of, what materials are needed, what the objectives are and what it is intended to achieve. In addition, we accompany each exercise with two additional resources. The first resource is a code repository where we upload the necessary materials, with a brief description and the code of the exercise. It can be a Python notebook , a Jupyter Notebook or a simple script, where the technical content is. And then another fundamental element that we believe is important and that is aimed at facilitating the execution of the exercises. In data science and programming, non-specialist users often find it difficult to set up a working environment. A Python exercise, for example, requires having a programming environment installed, knowing the necessary libraries and making configurations that are trivial for professionals, but can be very complex for beginners. To mitigate this barrier, we publish most of our exercises on Google Colab, a wonderful and open tool. Google Colab is a web programming environment where the user only needs a browser to access it. Basically, Google provides us with a virtual computer where we can run our programmes and exercises without the need for special configurations. The important thing is that the exercise is ready to use and we always check it in this environment, which makes it much easier to learn for beginners or less technically experienced users.
Juan Benavente: Yes, we always take a user-oriented approach, step by step, trying to make it open and accessible. The aim is for anyone to be able to run an exercise without the need for complex configurations, focusing on topics as close to reality as possible. We often take advantage of open data published by entities such as the DGT or other bodies to make realistic analyses. We have developed very interesting exercises, such as energy market predictions, analysis of critical materials for batteries and electronics, which allow learning not only about technology, but also about the specific subject matter.. You can get down to work right away, not only to learn, but also to find out about the subject.
-
In closing, we'd like you to offer a piece of advice that is more attitude-oriented than technical, what would you say to someone starting out in data science?
Alejandro Alija: As for an attitude tip for someone starting out in data science, I suggest be brave. There is no need to worry about being unprepared, because in this field everything is to be done and anyone can contribute value. Data science is multi-faceted: there are professionals closer to the business world who can provide valuable insights, and others who are more technical and need to understand the context of each area. My advice is to be content with the resources available without panicking, because, although the path may seem complex, the opportunities are very high. As a technical tip, it is important to be sensitive to the development and use of data. The more understanding one has of this world, the smoother the approach to projects will be.
Juan Benavente: I endorse the advice to be brave and add a reflection on programming: many people find the theoretical concept attractive, but when they get to practice and see the complexity of programming, some are discouraged by lack of prior knowledge or different expectations. It is important to add the concepts of patience and perseverance. When you start in this field, you are faced with multiple areas that you need to master: programming, statistics, mathematics, and specific knowledge of the sector you will be working in, be it marketing, logistics or another field. The expectation of becoming an expert quickly is unrealistic. It is a profession that, although it can be started without fear and by collaborating with professionals, requires a journey and a learning process. You have to be consistent and patient, managing expectations appropriately. Most people who have been in this world for a long time agree that they have no regrets about going into data science. It is a very attractive profession where you can add significant value, with an important technological component. However, the path is not always straightforward. There will be complex projects, moments of frustration when analyses do not yield the expected results or when working with data proves more challenging than expected. But looking back, few professionals regret having invested time and effort in training and developing in this field. In summary, the key tips are: courage to start, perseverance in learning and development of programming skills.