Data science has become a pillar of evidence-based decision-making in the public and private sectors. In this context, there is a need for a practical and universal guide that transcends technological fads and provides solid and applicable principles. This guide offers a decalogue of good practices that accompanies the data scientist throughout the entire life cycle of a project, from the conceptualization of the problem to the ethical evaluation of the impact.
- Understand the problem before looking at the data. The initial key is to clearly define the context, objectives, constraints, and indicators of success. A solid framing prevents later errors.
- Know the data in depth. Beyond the variables, it involves analyzing their origin, traceability and possible biases. Data auditing is essential to ensure representativeness and reliability.
- Ensure quality. Without clean data there is no science. EDA techniques, imputation, normalization and control of quality metrics allow to build solid and reproducible bases.
- Document and version. Reproducibility is a scientific condition. Notebooks, pipelines, version control, and MLOps practices ensure traceability and replicability of processes and models.
- Choose the right model. Sophistication does not always win: the decision must balance performance, interpretability, costs and operational constraints.
- Measure meaningfully. Metrics should align with goals. Cross-validation, data drift control and rigorous separation of training, validation and test data are essential to ensure generalization.
- Visualize to communicate. Visualization is not an ornament, but a language to understand and persuade. Data-driven storytelling and clear design are critical tools for connecting with diverse audiences.
- Work as a team. Data science is collaborative: it requires data engineers, domain experts, and business leaders. The data scientist must act as a facilitator and translator between the technical and the strategic.
- Stay up-to-date (and critical). The ecosystem is constantly evolving. It is necessary to combine continuous learning with selective criteria, prioritizing solid foundations over passing fads.
-
Be ethical. Models have a real impact. It is essential to assess bias, protect privacy, ensure explainability and anticipate misuse. Ethics is a compass and a condition of legitimacy.
Finally, the report includes a bonus-track on Python and R, highlighting that both languages are complementary allies: Python dominates in production and deployment, while R offers statistical rigor and advanced visualization. Knowing both multiplies the versatility of the data scientist.
The Data Scientist's Decalogue is a practical, timeless and cross-cutting guide that helps professionals and organizations turn data into informed, reliable and responsible decisions. Its objective is to strengthen technical quality, collaboration and ethics in a discipline in full expansion and with great social impact.
Content prepared by Alejandro Alija, expert in Digital Transformation and Innovation. The contents and points of view reflected in this publication are the sole responsibility of the author.
Spain's open data initiative, datos.gob.es, is revamped to offer a more accessible, intuitive and efficient experience. The change responds to the desire to improve access to data and facilitate its use by citizens, researchers, companies and administrations. With an updated design and new functionalities, the platform will continue to act as a meeting point for all those who seek to innovate based on data.
Focus on high-value datasets and web services
The new website reinforces its central axis, the National Open Data Catalogue, an access point to nearly 100,000 datasets, which group more than 500,000 files, and which the Spanish Public Administration makes available to companies, researchers and citizens for reuse. In it you can find datasets published by bodies of the General State Administration, regional, local, universities, etc.
One of the most relevant advances is the improvement in the possibilities for data publishers to describe in a more precise and structured way the data collections they wish to make available to the public. A more detailed description of the sources makes it easier for users to locate data of interest.
Specifically, the platform incorporates a new metadata model aligned with the latest versions of European standards, the national application profile DCAT-AP-ES, which adapts guidelines from the European metadata exchange scheme DCAT-AP (Data Catalog Vocabulary – Application Profile). This profile improves interoperability at national and European level, facilitates compliance with EU regulations, favors the federation of catalogues and the localization of datasets, and contributes to improving the quality of metadata through validation mechanisms, among other advantages.
In addition, the new version of datos.gob.es introduces significant improvements to the Catalog view, highlighting high-value data (HVD) and data offered through web services. To improve their identification, distinctive symbols have been added that allow you to differentiate the types of resources immediately.

Likewise, the number of documented metadata has been expanded, which is shown to users through a clearer structure. Metadata provided by publishers can now be categorized into general information, data sheet, contact and quality aspects. This new organization provides users with a more complete and accessible view of each dataset.

It is also worth noting that the data request process has been optimized to offer a more intuitive and fluid experience.
A new information architecture to improve usability
The new datos.gob.es platform has also adapted its information architecture to make it more intuitive and improve navigation and access to relevant information. The new settings make it easier to locate datasets and editorial content, while contributing to accessibility, ensuring that all users, regardless of their technical knowledge or device type, can interact with the website without difficulties.
Among other issues, the menu has been simplified, grouping the information into five large sections:
- Data: includes access to the National Catalogue, along with forms to request new data to be published as open. Information on data spaces and safe environments can also be found in this section, along with a section on resources for publisher support.
- Community: designed to learn more about open data initiatives in Spain and be inspired by examples of reuse through various use cases, organized into companies and applications. It should be noted that the map of initiatives has been updated with revised and improved files, with the option of filtering by the category of data offered, making it easier to consult. In this section we also find information on the challenges and the subsection of sectors, which has been considerably expanded, incorporating all those defined by the Technical Standard for Interoperability of Reuse of Information Resources, which allows a more complete view of both the data and its potential for use according to each area.
- News: users will be able to keep up to date with the latest developments in the data ecosystem through news and information on events related to the subject.
- Knowledge: one of the main novelties of the new platform is that all the resources that seek to promote data-based innovation have been unified under a single heading, making it easier to organize. Through this section, users will be able to access: blog articles, written by experts in various fields (data science, data governance, legal aspects, etc.), where trends in the sector are explained and analyzed; data exercises to learn step by step how to process and work with data; infographics that graphically summarize complex use cases or concepts; interviews with experts in podcast, video or written formats; and guides and reports, aimed at both publishers and reusers of data. Also included is the link to the GitHub repository, whose visibility has been strengthened in order to promote access and collaboration of the data community in the development of open tools and resources.
- About us: in addition to information about the project, FAQs, contact, platform technology, etc., in this section you can access the new dashboard, which now provides more detailed metrics on the catalog, content, and outreach actions.
The new version of datos.gob.es also introduces key improvements to the way content and datasets are located. The platform has been optimized with an intelligent search, which allows a guided search and a greater number of filters, making it easier to find information faster and more accurately.
Improved internal functionalities
The new version of datos.gob.es also brings with it internal improvements that will facilitate management for data publishers, optimizing processes. The private part accessed by agencies has been revamped to offer a more intuitive and functional interface. The console has been redesigned to streamline data management and administration, allowing for more efficient and structured control.
In addition, the content manager has been updated to its latest version, which guarantees better performance.
These enhancements reinforce datos.gob.es's commitment to the continuous evolution and optimization of its platform, ensuring a more accessible and efficient environment for all actors involved in the publication and management of open data. The new platform not only improves the user experience, but also drives data reuse across multiple industries.
We invite you to explore what's new and reap the benefits of data as a driver of innovation!
Once again, the Junta de Castilla y León has launched its open data contest to reward the innovative use of public information.
In this post, we summarize the details to participate in the IX edition of this event, which is an opportunity for both professionals and students, creative people or multidisciplinary teams who wish to give visibility to their talent through the reuse of public data.
What does the competition consist of?
The aim of the competition is to recognize projects that use open datasets from the Junta de Castilla y León. These datasets can be combined, if the participants wish, with other public or private sources, at any level of administration.
Projects can be submitted in four categories:
- Ideas category: aimed at people or teams who want to submit a proposal to create a service, studio, application, website or any other type of development. The project does not need to be completed; the important thing is that the idea is original, viable and has a potential positive impact.
- Products and services category: designed for projects already developed and accessible to citizens, such as online services, mobile applications or websites. All developments must be available via a public URL. This category includes a specific award for students enrolled in official education during the 2024/2025 or 2025/2026 school years.
- Didactic resource category: aimed at educational projects that use open data as a support tool in the classroom. The aim is to promote innovative teaching through Creative Commons licensed resources, which can be shared and reused by teachers and students.
- Data journalism category: it will reward journalistic works published or updated in a relevant way, in written or audiovisual format, that make use of open data to inform, contextualize or analyze topics of interest to citizens. The journalistic pieces must have been published in a printed or digital media since September 24, 2024, the day following the end date of the deadline for submission of candidacies of the immediately preceding call for awards.
In all categories, it is essential that at least one dataset from the open data portal of the Junta de Castilla y León is used. This platform has hundreds of datasets on different sectors such as the environment, economy, society, public administration, culture, education, etc. that can be used as a basis to develop useful, informative and transformative ideas.
Who can participate?
The competition is open to any natural or legal person, who can be presented individually or in a group. In addition, you can submit more than one application even for different categories. Although the same project may not receive more than one award, this flexibility allows the same idea to be explored from different approaches: educational, journalistic, technical or conceptual.
What prizes are awarded?
The 2025 edition of the contest includes prizes with a financial endowment, accrediting diploma and institutional dissemination through the open data portal and other communication channels of the Board.
The distribution and amount of the prizes by category is:
- Ideas category
- First prize: €1,500
- Second prize: €500
- Category products & services
- First prize: €2,500
- Second prize: €1,500
- Third prize: €500
- Special Student Prize: €1,500
- Category teaching resource
- First prize: €1,500
- Data journalism category
- First prize: €1,500
- Second prize: €1,000
Under what criteria are the prizes awarded? The jury will assess the candidatures considering different evaluation criteria, as set out in the rules and the order of call, including their originality, social utility, technical quality, feasibility, impact, economic value and degree of innovation.
How to participate?
As in other editions, candidacies can be submitted in two ways:
- In person, at the General Registry of the Ministry of the Presidency, at the registry assistance offices of the Junta de Castilla y León or at the places established in article 16.4 of Law 39/2015.
- Electronics, through the electronic headquarters of the Junta de Castilla y León
Each application must include:
- Identification data of the author(s).
- Title of the project.
- Category or categories to which it is submitted.
- An explanatory report of the project, with a maximum length of 1,000 words, providing all the information that can be assessed by the jury according to the established scale.
- In the case of submitting an application to the Products and Services category, the URL to access the project will be specified
The deadline to submit proposals is September 22, 2025
With this contest, the Junta de Castilla y León reaffirms its commitment to the open data policy and the culture of reuse. The competition not only recognizes the creativity, innovation and usefulness of the projects presented, but also contributes to disseminating the transformative potential of open data in areas such as education, journalism, technology or social entrepreneurship.
In previous editions, solutions to improve mobility, interactive maps on forest fires, tools for the analysis of public expenditure or educational resources on the rural environment, among many other examples, have been awarded. You can read more about last year's winning proposals and others on our website. In addition, all these projects can be consulted in the history of winners available on the community's open data portal.
We encourage you to participate in the contest and get the most out of open data in Castilla y León!
Madrid City Council has launched an initiative to demonstrate the potential of open data: the first edition of the Open Data Reuse Awards 2025. With a total budget of 15,000 euros, this competition seeks to promote the reuse of the data shared by the council on its open data portal, demonstrating that they can be a driver of social innovation and citizen participation.
The challenge is clear: to turn data into useful, original and impactful ideas. If you think you can do it, below, we summarize the information you must consider to compete.
Who can participate?
The competition is open to practically everyone: from individuals to companies or groups of any kind. The only condition is to submit a project carried out between September 10, 2022 and September 9, 2025 and that uses at least one dataset from the Madrid City Council's open data portal as a base. Data from other public and private sources can also be used, as long as the Madrid City Council datasets are a key part of the project.
Of course, projects that have already been awarded, contracted or financed by the City Council itself are not accepted, nor are works submitted after the deadline or without the required documentation.
What projects can be submitted?
There are four main areas in which you can participate:
- Web services and applications: refers to projects that provide services, studios, web applications, or mobile apps.
- Studies, research and ideas: refers to projects of exploration, analysis or description of ideas aimed at the creation of services, studies, visualizations, web applications or mobile apps. Bachelor's and master's degree final university projects can also participate in this category.
- Proposals to improve the quality of the open data portal: includes projects, services, applications or initiatives that contribute to boosting the quality of the datasets published on the Madrid City Council's open data portal.
- Data visualizations: you can participate in this category with various content, such as maps, graphs, tables, 3D models, digital art, web applications and animations. Representations can be static, such as infographics, posters, or figures in publications, or dynamic, including videos, interactive dashboards, and stories.
What are the prizes?
For each category, two prizes for different economic endowments are awarded:
Category |
First prize |
Second prize |
Web services and applications |
3.000 € | 1.500 € |
Proposals to improve the quality of the open data portal | 3.000 € | 1.500 € |
Studies, research and ideas | 2.000 € | 1.000 € |
Data visualizations | 2.000 € | 1.000 € |
Figure 1. Prize money for the first edition of the 2025 Open Data Reuse Awards. Source: Madrid City Council.
Beyond the economic prize, this call is a great opportunity to give visibility to ideas that take advantage of the transparency and potential of open data. In addition, if the proposal improves public services, solves a real problem or helps to better understand the city, it will have great value that goes far beyond recognition.
How are projects valued?
A jury will evaluate each project by assigning a maximum score of 50 points, which will take into account aspects such as originality, social benefit, technical quality, accessibility, ease of use, or even design, in the case of visualizations. If deemed necessary, the jury may request further information submitted to the participants.
The two projects with the highest score will win, although to be considered, the proposals must reach at least 25 points out of a possible 50. If none of them meets this requirement, the category will be declared void.
The jury will be made up of representatives from different areas of the City Council, with experience in innovation, transparency, technology and data. A representative of ASEDIE (Multisectoral Association of Information), the association that promotes the reuse and distribution of information in Spain, will also participate.
How do I participate?
The deadline to register is September 9, 2025 at 11:59 p.m. In the case of natural people, the application can be submitted:
- Online through the City Council's Electronic Office. This procedure requires identification and electronic signature.
- In person at municipal service offices.
In the case of legal people, they may only submit their candidacy electronically.
In any case, the official form must be completed and accompanied by a report explaining the project, its operation, its benefits, the use of the data, and if possible, including screenshots, links or prototypes.
You can see the complete rules here.
More than 90,000 people from all over the world participated in the latest edition of the Space App Challenge. This annual two-day event, organized by the US space agency, NASA, is an opportunity to innovate and learn about the advantages that open space data can offer.
This year the competition will be held on October 4 and 5. Through a hackathon, participants will engage first-hand with NASA's most relevant missions and research. It's an opportunity to learn how to launch and lead projects through hands-on use of NASA data in the real world. In addition, it is a free activity open to anyone (those under 18 years of age must be accompanied by a legal guardian).
In this post, we tell you some of the keys you need to know about this global benchmark event.
Where is it held?
Under the banner of the Space Apps Challenge, virtual and face-to-face events take place all over the world. Specifically, in Spain, meetings are held in several cities:
- Barcelona
- Where: in person, at 42 Barcelona (Carrer D'Albert Einstein 11).
- Madrid
- Where: face-to-face, at the School of Digital Competences – San Blas Digital (Calle Amposta, 34).
- Murcia
- Where: in person at UCAM HITECH (Av. Andrés Hernandez Ros, 1, Guadalupe).
- Malaga
- Where: Face-to-face, at a location to be determined (you can contact the event organizer through the link).
- Pamplona
- Where: face-to-face and virtual, in a location to be determined (you can contact the event organization through the link)
- San Vicente del Raspeig (Alicante)
- Where: in person, at the Alicante Science Park (University of Alicante, San Vicente del Raspeig).
- Seville
- Where: Face-to-face, at a location yet to be determined (you can contact the event organizer via the link).
- Valencia
- Where: in person, at the UPV Student House, Polytechnic University of Valencia (Camino de Vera, s/n Building 4K).
- Zaragoza
- Where: in person, at the Betancourt Building, Río Ebro Campus (EINA) Calle María de Luna, 1.
All of them will have a welcome ceremony on Friday, October 3 at 5:30 p.m . in which the details of the competition will be presented, the teams and the themes of each challenge will be organized.
To participate in any of the events, you can register individually and the organization will help you find a team. You can also register your team directly (of a maximum of 6 people).
If you can't find any in-person events near you, you can sign up for the universal event that will be online.
Are there any prizes?
Yes! Each event will award its own prizes. In addition, NASA recognizes, each year, ten global awards divided into different categories:
-
Best Use of Science Award: recognizes the project that makes the most valid and outstanding use of science and/or the scientific method.
-
Best Data Use Award: awarded to the project that makes spatial data more accessible or uses it in a unique way.
-
Best Use of Technology Award: distinguishes the project that represents the most innovative use of technology.
-
Galactic Impact Award: awarded to the project with the greatest potential to improve life on Earth or in the universe.
-
Best Mission Concept Award: recognizes the project with the most plausible concept and design.
-
Most Inspiring Award: It is awarded to the project that manages to move and inspire the public.
-
Best Narrative Award: Highlights the project that most creatively communicates the potential of open data through the art of storytelling.
-
Global Connection Award: awarded to the project that best connects people around the world through technology.
-
Art and Technology Award: recognizes the project that most effectively combines technical and creative skills.
- Local Impact Award: awarded to the project that demonstrates the greatest potential to generate impact at the local level.
Figure 1. Space App Challenge Awards. Source: https://www.spaceappschallenge.org/brand/
From Gijón to the world: the Spanish project awarded in 2024
In last year's edition, a Spanish project, specifically from Gijón, won the global award for best mission concept with its Landsat Connect application proposal. The AsturExplorer team developed a web application designed to provide a fast, simple and intuitive way to track the path of Landsat satellites and access surface reflectance data. Their project fostered interdisciplinary and scientific learning capacities, and empowered citizens.
The Landsat program consists of a series of Earth observation satellite missions, jointly managed by NASA and the U.S. Geological Survey (USGS), providing images and data about our planet since 1972.
End users of the app developed by AsturExplorer can set a destination location and receive notifications in advance to know when the Landsat satellite will pass over each area. This allows users to prepare and take their own measurements on the ground and obtain pixel data without the need to constantly monitor satellite schedules.
The AsturExplorer team used open Landsat data from NASA and Earth Explorer. They also made use of artificial intelligence to understand the technical problem and compare multiple alternatives. You can read more about this use case here.
How do I register?
The Space App Challenge website offers a section of frequently asked questions and a video tutorial to facilitate registration. The process is simple:
- Create an account
- Register for the Hackathon
- Choose a local event
- Join a team and form your own
- Submit a project (before 11.59am on 5 October)
- Complete the Engagement Survey
We encourage you to be part of this global benchmark event where you will reuse open datasets. A great opportunity!
March is approaching and with it a new edition of the Open Data Day. It is an annual worldwide celebration that has been organised for 12 years, promoted by the Open Knowledge Foundation through the Open Knowledge Network. It aims to promote the use of open data in all countries and cultures.
This year's central theme is "Open data to address the polycrisis". The term polycrisis refers to a situation where different risks exist in the same time period. This theme aims to focus on open data as a tool to address, through its reuse, global challenges such as poverty and multiple inequalities, violence and conflict, climate risks and natural disasters.
If several years ago the activities were limited to a single day, from 2023 we have a week to enjoy various conferences, seminars, workshops, etc. centred on this theme. Specifically, in 2025, Open Data Day activities will take place from 1 to 7 March.
Through its website you can see the various activities that will take place throughout the week all over the world. In this article we review some of those that you can follow from Spain, either because they take place in Spain or because they can be followed online.
Open Data Day 2025: Women Leading Open Data for Equality
Iniciativa Barcelona Open Data is organising a session on the afternoon of 6 March focusing on how open data can help address equality challenges. The event will bring together women experts in data technologies and open data, to share knowledge, experiences and best practices in both the publication and reuse of open data in this field.
The event will start at 17:30 with a welcome and introduction. This will be followed by two panel discussions and an interview:
- Round Table 1. Publishing institutions. Gender-sensitive data strategy to address the feminist agenda.
- DIALOGUE Data lab. Building feminist Tech Data practice.
- Round Table 2. Re-users. Projects based on the use of open data to address the feminist agenda.
The day will end at 19:40 with a cocktail and the opportunity for attendees to discuss the topics discussed and expand their network through networking.
How can you follow the event? This is an in-person event, which will be held at Ca l'Alier, Carrer de Pere IV, 362 (Barcelona).
Open access scientific and scholarly publishing as a tool to face the 21st century polycrisis: the key role of publishers
Organised by a private individual, Professor Damián Molgaray, this conference looks at the key role of editors in open access scientific and scholarly publishing. The idea is for participants to reflect on how open knowledge is positioned as a fundamental tool to face the challenges of the 21st century polycrisis, with a focus on Latin America.
The event will take place on 4 March at 11:00 in Argentina (15:00 in mainland Spain).
How can you follow the event? This is an online event through Google Meet.
WhoFundsThem
The organisation mySociety will show the results of its latest project. Over the last few months, a team of volunteers has collected data on the financial interests of the 650 MPs in the UK House of Commons, using sources such as the official Register of Interests, Companies House, MPs' attendance at debates etc. This data, checked and verified with MPs themselves through a 'right of reply' system, has been transformed into an easily accessible format, so that anyone can easily understand it, and will be published on the parliamentary tracking website TheyWorkForYou.
At this event, the project will be presented and the conclusions will be discussed. It takes place on Tuesday 4 at 14:00 London time (15:00 in Spain peninsular).
How can you follow the event? The session can be followed online, but registration is required. The event will be in English.
Science on the 7th: A conversation on Open Data & Air Quality
El viernes 7 a las 9:00 EST – (15:00 en España peninsular) se podrá seguir online una conferencia sobre datos abiertos y calidad del aire. La sesión reunirá a diversos expertos para debatir los temas de actualidad en materia de calidad del aire y salud mundial, y se examinará la contaminación atmosférica procedente de fuentes clave, como las partículas, el ozono y la contaminación relacionada con el tráfico.
Esta iniciativa está organizada por Health Effects Institute, una corporación sin ánimo de lucro que proporciona datos científicos sobre los efectos de la contaminación atmosférica en la salud.
A conference on open data and air quality will be available online on Friday 7 at 9:00 EST (15:00 in mainland Spain). The session will bring together a range of experts to discuss topical issues in air quality and global health, and will examine air pollution from key sources such as particulate matter, ozone and traffic-related pollution.
This initiative is organised by Health Effects Institute, a non-profit corporation that provides scientific data on the health effects of air pollution.
How can you follow the event? The conference, which will be in English, can be viewed on YouTube. No registration is required.
Deadline open for new event proposals
The above events are just a few examples of the activities that are part of this global celebration, but, as mentioned above, you can see all the actions on the initiative's website.
In addition, the deadline for registering new events is still open. If you have a proposal, you can register it via this link.
From datos.gob.es we invite you to join this week of celebration, which serves to vindicate the power of open data to generate positive changes in our society. Don't miss it!
Open data portals are an invaluable source of public information. However, extracting meaningful insights from this data can be challenging for users without advanced technical knowledge.
In this practical exercise, we will explore the development of a web application that democratizes access to this data through the use of artificial intelligence, allowing users to make queries in natural language.
The application, developed using the datos.gob.es portal as a data source, integrates modern technologies such as Streamlit for the user interface and Google's Gemini language model for natural language processing. The modular nature allows any Artificial Intelligence model to be used with minimal changes. The complete project is available in the Github repository.
Access the data laboratory repository on Github.
Run the data preprocessing code on Google Colab.
In this video, the author explains what you will find both on Github and Google Colab.
Application Architecture
The core of the application is based on four main interconnected sections that work together to process user queries:
- Context Generation
- Analyzes the characteristics of the chosen dataset.
- Generates a detailed description including dimensions, data types, and statistics.
- Creates a structured template with specific guidelines for code generation.
- Context and Query Combination
- Combines the generated context with the user's question, creating the prompt that the artificial intelligence model will receive.
- Response Generation
- Sends the prompt to the model and obtains the Python code that allows solving the generated question.
- Code Execution
- Safely executes the generated code with a retry and automatic correction system.
- Captures and displays the results in the application frontend.
Figure 1. Request processing flow
Development Process
The first step is to establish a way to access public data. The datos.gob.es portal offers datasets via API. Functions have been developed to navigate the catalog and download these files efficiently.
Figura 2. API de datos.gob
The second step addresses the question: how to convert natural language questions into useful data analysis? This is where Gemini, Google's language model, comes in. However, it's not enough to simply connect the model; it's necessary to teach it to understand the specific context of each dataset.
A three-layer system has been developed:
- A function that analyzes the dataset and generates a detailed "technical sheet".
- Another that combines this sheet with the user's question.
- And a third that translates all this into executable Python code.
You can see in the image below how this process develops and, subsequently, the results of the generated code are shown once executed.
Figure 3. Visualization of the application's response processing
Finally, with Streamlit, a web interface has been built that shows the process and its results to the user. The interface is as simple as choosing a dataset and asking a question, but also powerful enough to display complex visualizations and allow data exploration.
The final result is an application that allows anyone, regardless of their technical knowledge, to perform data analysis and learn about the code executed by the model. For example, a municipal official can ask "What is the average age of the vehicle fleet?" and get a clear visualization of the age distribution.
Figure 4. Complete use case. Visualizing the distribution of registration years of the municipal vehicle fleet of Almendralejo in 2018
What Can You Learn?
This practical exercise allows you to learn:
- AI Integration in Web Applications:
- How to communicate effectively with language models like Gemini.
- Techniques for structuring prompts that generate precise code.
- Strategies for safely handling and executing AI-generated code.
- Web Development with Streamlit:
- Creating interactive interfaces in Python.
- Managing state and sessions in web applications.
- Implementing visual components for data.
- Working with Open Data:
- Connecting to and consuming public data APIs.
- Processing Excel files and DataFrames.
- Data visualization techniques.
- Development Best Practices:
- Modular structuring of Python code.
- Error handling and retries.
- Implementation of visual feedback systems.
- Web application deployment using ngrok.
Conclusions and Future
This exercise demonstrates the extraordinary potential of artificial intelligence as a bridge between public data and end users. Through the practical case developed, we have been able to observe how the combination of advanced language models with intuitive interfaces allows us to democratize access to data analysis, transforming natural language queries into meaningful analysis and informative visualizations.
For those interested in expanding the system's capabilities, there are multiple promising directions for its evolution:
- Incorporation of more advanced language models that allow for more sophisticated analysis.
- Implementation of learning systems that improve responses based on user feedback.
- Integration with more open data sources and diverse formats.
- Development of predictive and prescriptive analysis capabilities.
In summary, this exercise not only demonstrates the feasibility of democratizing data analysis through artificial intelligence, but also points to a promising path toward a future where access to and analysis of public data is truly universal. The combination of modern technologies such as Streamlit, language models, and visualization techniques opens up a range of possibilities for organizations and citizens to make the most of the value of open data.
Promoting the data culture is a key objective at the national level that is also shared by the regional administrations. One of the ways to achieve this purpose is to award those solutions that have been developed with open datasets, an initiative that enhances their reuse and impact on society.
On this mission, the Junta de Castilla y León and the Basque Government have been organising open data competitions for years, a subject we talked about in our first episode of the datos.gob.es podcast that you can listen to here.
In this post, we take a look at the winning projects in the latest editions of the open data competitions in the Basque Country and Castilla y León.
Winners of the 8th Castile and Leon Open Data Competition
In the eighth edition of this annual competition, which usually opens at the end of summer, 35 entries were submitted, from which 8 winners were chosen in different categories.
Ideas category: participants had to describe an idea to create studies, services, websites or applications for mobile devices. A first prize of 1,500€ and a second prize of 500€ were awarded.
- First prize: Green Guardians of Castilla y León presented by Sergio José Ruiz Sainz. This is a proposal to develop a mobile application to guide visitors to the natural parks of Castilla y León. Users can access information (such as interactive maps with points of interest) as well as contribute useful data from their visit, which enriches the application.
- Second prize: ParkNature: intelligent parking management system in natural spaces presented by Víctor Manuel Gutiérrez Martín. It consists of an idea to create an application that optimises the experience of visitors to the natural areas of Castilla y León, by integrating real-time data on parking and connecting with nearby cultural and tourist events.
Products and Services Category: Awarded studies, services, websites or applications for mobile devices, which must be accessible to all citizens via the web through a URL. In this category, first, second and third prizes of €2,500, €1,500 and €500 respectively were awarded, as well as a specific prize of €1,500 for students.
- First prize: AquaCyL from Pablo Varela Vázquez. It is an application that provides information about the bathing areas in the autonomous community.
- Second prize: ConquistaCyL presented by Markel Juaristi Mendarozketa and Maite del Corte Sanz. It is an interactive game designed for tourism in Castilla y León and learning through a gamified process.
- Third prize: All the sport of Castilla y León presented by Laura Folgado Galache. It is an app that presents all the information of interest associated with a sport according to the province.
- Student prize: Otto Wunderlich en Segovia by Jorge Martín Arévalo. It is a photographic repository sorted according to type of monuments and location of Otto Wunderlich's photographs.
Didactic Resource Category: consisted of the creation of new and innovative open didactic resources to support classroom teaching. These resources were to be published under Creative Commons licences. A single first prize of €1,500 was awarded in this category.
- First prize: StartUp CyL: Business creation through Artificial Intelligence and Open Data presented by José María Pérez Ramos. It is a chatbot that uses the ChatGPT API to assist in setting up a business using open data.
Data Journalism category: awarded for published or updated (in a relevant way) journalistic pieces, both in written and audiovisual media, and offered a prize of €1,500.
- First prize: Codorniz, perdiz y paloma torcaz son las especies más cazadas en Burgos, presented by Sara Sendino Cantera, which analyses data on hunting in Burgos.
Winners of the 5th edition of the Open Data Euskadi Open Data Competition
As in previous editions, the Basque open data portal opened two prize categories: an ideas competition and an applications competition, each of which was divided into several categories. On this occasion, 41 applications were submitted for the ideas competition and 30 for the applications competition.
Idea competition: In this category, two prizes of €3,000 and €1,500 have been awarded in each category.
Health and Social Category
- First prize: Development of a Model for Predicting the Volume of Patients attending the Emergency Department of Osakidetza by Miren Bacete Martínez. It proposes the development of a predictive model using time series capable of anticipating both the volume of people attending the emergency department and the level of severity of cases.
- Second prize: Euskoeduca by Sandra García Arias. It is a proposed digital solution designed to provide personalised academic and career guidance to students, parents and guardians.
Category Environment and Sustainability
- First prize: Baratzapp by Leire Zubizarreta Barrenetxea. The idea consists of the development of a software that facilitates and assists in the planning of a vegetable garden by means of algorithms that seek to enhance the knowledge related to the self-consumption vegetable garden, while integrating, among others, climatological, environmental and plot information in a personalised way for the user.
- Second prize: Euskal Advice by Javier Carpintero Ordoñez. The aim of this proposal is to define a tourism recommender based on artificial intelligence.
General Category
- First prize: Lanbila by Hodei Gonçalves Barkaiztegi. It is a proposed app that uses generative AI and open data to match curriculum vitae with job offers in a semantic way.. It provides personalised recommendations, proactive employment and training alerts, and enables informed decisions through labour and territorial indicators.
- Second prize: Development of an LLM for the interactive consultation of Open Data of the Basque Government by Ibai Alberdi Martín. The proposal consists in the development of a Large Scale Language Model (LLM) similar to ChatGPT, specifically trained with open data, focused on providing a conversational and graphical interface that allows users to get accurate answers and dynamic visualisations.
Applications competition: this modality has selected one project in the web services category, awarded with €8,000, and two more in the General Category, which have received a first prize of €8,000 and a second prize of €5,000.
Category Web Services
- First prize: Bizidata: Plataforma de visualización del uso de bicicletas en Vitoria-Gasteiz by Igor Díaz de Guereñu de los Ríos. It is a platform that visualises, analyses and downloads data on bicycle use in Vitoria-Gasteiz, and explores how external factors, such as the weather and traffic, influence bicycle use.
General Category
- First prize: Garbiñe AI by Beatriz Arenal Redondo. It is an intelligent assistant that combines Artificial Intelligence (AI) with open data from Open Data Euskadi to promote the circular economy and improve recycling rates in the Basque Country.
- Second prize: Vitoria-Gasteiz Businessmap by Zaira Gil Ozaeta. It is an interactive visualisation tool based on open data, designed to improve strategic decisions in the field of entrepreneurship and economic activity in Vitoria-Gasteiz.
All these award-winning solutions reuse open datasets from the regional portal of Castilla y León or Euskadi, as the case may be. We encourage you to take a look at the proposals that may inspire you to participate in the next edition of these competitions. Follow us on social media so you don't miss out on this year's calls!
ASEDIE, Asociación Multisectorial de la Información, will hold its usual International Conference on the Reuse of Public Sector Information on December 12. This will be its 16th edition and the central theme is "ASEDIE, 25 years driving the data economy". The aim of the meeting is to address the progress made during this time, provide a snapshot of the current situation and discuss barriers and possible solutions for the re-use of public sector information.
When and where does it take place?
The event will be held in a face-to-face format on 12 December 2024 at the National Statistics Institute (INE), located at Avenida de Manoteras 52, in Madrid. Seating is limited, the reception will start at 9:00 and the event will end at 13:40. To attend the event you must register at this link..
What is the programme?
The focus of this edition will be on the reuse of public sector information and on commemorating the 25 years that the ASEDIE Association has been promoting the data economy in Spain.
The session will open at 9:30 a.m. with the inauguration of the event by the President of ASEDIE, Ignacio Jiménez and the President of INE, Elena Manzanera, to welcome the attendees.
The event will feature three round tables:
- The first round table will take place from 9:45 to 10:30 and will deal with 'Artificial Intelligence and data protection coexisting with reuse'. It will feature the participation of Miguel Valle del Olmo, Digital Transformation Advisor of the Permanent Representation of Spain to the European Union and Leonardo Cervera Navas, Secretary General of European Data Protection Supervisor; and will be moderated by Valentín Arce, Vice-president of ASEDIE.
At the end of this thematic block, the ASEDIE 2024 Award will be presented to recognize those individuals, companies or institutions that stand out for the best work or the greatest contribution to innovation and development of the Infomediary sector in the current year.
After a coffee break, the second round table will start at 11:30:
- This second roundtable under the title "Leadership in open data" will bring together leading figures from the public sector to highlight their coordinating role. The event will be attended by Carmen Cabanilla, Director General of Public Governance of the Secretary of State for Public Function; Ruth del Campo, General Data Director and Francisco Javier García Vieira, Director of RedIRIS and Digital Public Services of Red.es.. All this, moderated by Manuel Suarez, Member of the Board of Directors of ASEDIE.
- The third round table on "The reality of open data: quality, governance and access" will start at 12:30 and will be moderated by Carmen de Pablo, Professor at the Universidad Rey Juan Carlos. This round table will be attended by Fernando Serrano, Advisor to the General Directorate of Cadastre; Joseba Asiain, Director General of the Presidency, Open Government and Relations with the Parliament of the Government of Navarre and Ángela Perez, Director General of Transparency and Quality of the Madrid City Council.
Finally, the event will end with a brief closing speech by Ignacio Jiménez, president of ASEDIE.
You can consult the complete program here.
How can I register?
Attendance is in person with limited seating and registrations can be made on the ASEDIE website.
Open data portals play a fundamental role in accessing and reusing public information. A key aspect in these environments is the tagging of datasets, which facilitates their organization and retrieval.
Word embeddings represent a transformative technology in the field of natural language processing, allowing words to be represented as vectors in a multidimensional space where semantic relationships are mathematically preserved. This exercise explores their practical application in a tag recommendation system, using the datos.gob.es open data portal as a case study.
The exercise is developed in a notebook that integrates the environment configuration, data acquisition, and recommendation system processing, all implemented in Python. The complete project is available in the Github repository.
Access the data lab repository on GitHub.
Run the data preprocessing code on Google Colab.
In this video, the author explains what you will find both on Github and Google Colab (English subtitles available).
Understanding word embeddings
Word embeddings are numerical representations of words that revolutionize natural language processing by transforming text into a mathematically processable format. This technique encodes each word as a numerical vector in a multidimensional space, where the relative position between vectors reflects semantic and syntactic relationships between words. The true power of embeddings lies in three fundamental aspects:
- Context capture: unlike traditional techniques such as one-hot encoding, embeddings learn from the context in which words appear, allowing them to capture meaning nuances.
- Semantic algebra: the resulting vectors allow mathematical operations that preserve semantic relationships. For example, vector('Madrid') - vector('Spain') + vector('France') ≈ vector('Paris'), demonstrating the capture of capital-country relationships.
- Quantifiable similarity: similarity between words can be measured through metrics, allowing identification of not only exact synonyms but also terms related in different degrees and generalizing these relationships to new word combinations.
In this exercise, pre-trained GloVe (Global Vectors for Word Representation) embeddings were used, a model developed by Stanford that stands out for its ability to capture global semantic relationships in text. In our case, we use 50-dimensional vectors, a balance between computational complexity and semantic richness. To comprehensively evaluate its ability to represent Spanish language, multiple tests were conducted:
- Word similarity was analyzed using cosine similarity, a metric that evaluates the angle between two word vectors. This measure results in values between -1 and 1, where values close to 1 indicate high semantic similarity, while values close to 0 indicate little or no relationship. Terms like "amor" (love), "trabajo" (work), and "familia" (family) were evaluated to verify that the model correctly identified semantically related words.
- The model's ability to solve linguistic analogies was tested, for example, "hombre es a mujer lo que rey es a reina" (Man is to woman what king is to queen), confirming its ability to capture complex semantic relationships.
- Vector operations were performed (such as "rey - hombre + mujer") to check if the results maintained semantic coherence.
- Finally, dimensionality reduction techniques were applied to a representative sample of 40 Spanish words, allowing visualization of semantic relationships in a two-dimensional space. The results revealed natural grouping patterns among semantically related terms, as observed in the figure:
- Emotions: alegría (joy), felicidad (happiness) or pasión (passion) appear grouped in the upper right.
- Family-related terms: padre (father), hermano (brother) or abuelo (grandfather) concentrate at the bottom.
- Transport: coche (car), autobús (bus), or camión (truck) form a distinctive group.
- Colors: azul (blue), verde (green) or rojo (red) appear close to each other.
Figure 1. Principal Components Analysis on 50 dimensions (embeddings) with an explained variability percentage by the two components of 0.46
To systematize this evaluation process, a unified function has been developed that encapsulates all the tests described above. This modular architecture allows automatic and reproducible evaluation of different pre-trained embedding models, thus facilitating objective comparison of their performance in Spanish language processing. The standardization of these tests not only optimizes the evaluation process but also establishes a consistent framework for future comparisons and validations of new models by the public.
The good capacity to capture semantic relationships in Spanish language is what we leverage in our tag recommendation system.
Embedding-based Recommendation System
Leveraging the properties of embeddings, we developed a tag recommendation system that follows a three-phase process:
- Embedding generation: for each dataset in the portal, we generate a vector representation combining the title and description. This allows us to compare datasets by their semantic similarity.
- Similar dataset identification: using cosine similarity between vectors, we identify the most semantically similar datasets.
- Tag extraction and standardization: from similar sets, we extract their associated tags and map them with Eurovoc thesaurus terms. This thesaurus, developed by the European Union, is a multilingual controlled vocabulary that provides standardized terminology for cataloging documents and data in the field of European policies. Again, leveraging the power of embeddings, we identify the semantically closest Eurovoc terms to our tags, thus ensuring coherent standardization and better interoperability between European information systems.
The results show that the system can generate coherent and standardized tag recommendations. To illustrate the system's operation, let's take the case of the dataset "Tarragona City Activities Agenda":
Figure 2. Tarragona City Events Guide
The system:
- Finds similar datasets like "Terrassa Activities Agenda" and "Barcelona Cultural Agenda".
- Identifies common tags from these datasets, such as "EXHIBITIONS", "THEATER", and "CULTURE".
- Suggests related Eurovoc terms: "cultural tourism", "cultural promotion", and "cultural industry".
Advantages of the Approach
This approach offers significant advantages:
- Contextual Recommendations: the system suggests tags based on the real meaning of the content, not just textual matches.
- Automatic Standardization: integration with Eurovoc ensures a controlled and coherent vocabulary.
- Continuous Improvement: the system learns and improves its recommendations as new datasets are added.
- Interoperability: the use of Eurovoc facilitates integration with other European systems.
Conclusions
This exercise demonstrates the great potential of embeddings as a tool for associating texts based on their semantic nature. Through the analyzed practical case, it has been possible to observe how, by identifying similar titles and descriptions between datasets, precise recommendations of tags or keywords can be generated. These tags, in turn, can be linked with keywords from a standardized thesaurus like Eurovoc, applying the same principle.
Despite the challenges that may arise, implementing these types of systems in production environments presents a valuable opportunity to improve information organization and retrieval. The accuracy in tag assignment can be influenced by various interrelated factors in the process:
- The specificity of dataset titles and descriptions is fundamental, as correct identification of similar content and, therefore, adequate tag recommendation depends on it.
- The quality and representativeness of existing tags in similar datasets directly determines the relevance of generated recommendations.
- The thematic coverage of the Eurovoc thesaurus, which, although extensive, may not cover specific terms needed to describe certain datasets precisely.
- The vectors' capacity to faithfully capture semantic relationships between content, which directly impacts the precision of assigned tags.
For those who wish to delve deeper into the topic, there are other interesting approaches to using embeddings that complement what we've seen in this exercise, such as:
- Using more complex and computationally expensive embedding models (like BERT, GPT, etc.)
- Training models on a custom domain-adapted corpus.
- Applying deeper data cleaning techniques.
In summary, this exercise not only demonstrates the effectiveness of embeddings for tag recommendation but also unlocks new possibilities for readers to explore all the possibilities this powerful tool offers.