Noticia

The reuse of open data makes it possible to generate innovative solutions that improve people's lives, boost citizen participation and strengthen public transparency. Proof of this are the competitions promoted this year by the Junta de Castilla y León and the Madrid City Council.

Being the IX edition of the Castilla y León Competition and the first edition of the Madrid Competition, both administrations have presented the prizes to the selected projects, recognising both students and startups as well as professionals and researchers who have been able to transform public data into useful tools and knowledge. In this post, we review the award-winning projects in each competition and the context that drives them.

Castilla y León: ninth edition of consolidated awards in a more open administration

At the awards ceremony of the IX Open Data Contest of the Junta de Castilla y León,  the budget reinforcement (+65%) in the General Directorate of Transparency and Good Governance, the expansion of active advertising content and a continuous improvement of the right of access to public information, which has reduced requests and rejection resolutions, were highlighted. The Open Data Portal of Castilla y León has 776 datasets that allow the development of services, applications and studies each year.

The Open Data Awards recognize initiatives in four categories: Ideas, Products and Services, Teaching Resources, Data Journalism.

Ideas

  • First prize: CyL Rural Hub. Proposal to develop a comprehensive platform for the rural territory that centralises services, infrastructures, job opportunities and educational offer. Its objective is to provide families and professionals with useful information to plan a life project in the villages of the community.

  • Second prize: Cultural App of Castilla y León. An idea aimed at boosting cultural activity through an application that centralises events, activities and locations, also offering an intuitive and close experience based on open data.

Products & Services

  • First prize: CyL Bridge. Application designed to support the integration of migrants through personalized routes, an artificial intelligence assistant and a resource center powered by public data.

  • Second prize: MuniCyLA tool that brings together dispersed municipal information and presents it on a single clear, accessible and up-to-date platform.

  • Third prize: Interactive map of Natural SpacesA resource that allows citizens to explore the protected areas of the territory dynamically and in real time.

  • Student awards: Info Salamanca. Platform that offers interactive maps, thematic filters and a conversational assistant to bring provincial information closer and facilitate the consultation of citizen data.

Teaching Resource

Data Journalism

Madrid: first edition of awards that promote reuse in the urban environment

On the other hand, the Madrid City Council has held the first edition of the Open Data Reuse Awards 2025. The ceremony highlighted the quality and diversity of the 65 applications submitted, many of them driven by university students and startups.

The awards seek to promote the use of data  from the Madrid City Council's Open Data Portal, support the creation of services and studies that contribute to knowledge of the city and reinforce the role of the city council as a benchmark administration in transparency and accountability.

In this case, the awards are structured into four categories: Web Services and Applications, Visualizations, Studies and Ideas, and Portal Improvement.

Web Services & Applications

  • First prize: Madriwa. Find your place in MadridA tool that facilitates the search for housing through data on neighbourhoods, services and prices, allowing an informed and simplified comparison.

  • Second prize: The guardians of the airApplication developed by Tangible Data to check the city's air quality, especially designed to raise awareness among young people and educational centers.

Data viz

  • First prize: Ramp. Routes for people with reduced mobilityIt presents accessible itineraries based on geospatial and orography data, offering alternative routes adapted to people with reduced mobility.

  • Second prize: AccesibiliMadIt shows public services available in each urban environment, with special attention to the specific needs of different groups.

Studies, Research and Ideas

  • First prize: Fifteen-minute cities for children. Analysis of the availability of essential services for minors within a maximum radius of 15 minutes, providing an innovative vision of urban planning.

  • Second prize: The impact of tourism in urban areas. This study delves into the relationship between tourist housing, the commercial fabric and labour dynamics, using urban and socio-economic data.

Improving Portal Quality

  • First prize: Your Open Data. Improving harvesting in data.europa.eu. Proposal that improves the way data is provided, raising the quality of metadata and boosting European interoperability.

  • Second prize: Discovery, observability and intelligent governance of open data. Solution that introduces an automated layer of intelligence and control over the municipal catalog.

Both Castilla y León, with a consolidated track record, and the Madrid City Council, which inaugurates its own recognition, contribute decisively to strengthening the Spanish open data ecosystem. Its calls are an example of how collaboration between administrations, citizens, academia and the private sector can transform public data into knowledge, participation and innovation at the service of society as a whole.

calendar icon
Documentación

In the public sector ecosystem, subsidies represent one of the most important mechanisms for promoting projects, companies and activities of general interest. However, understanding how these funds are distributed, which agencies call for the largest grants or how the budget varies according to the region or beneficiaries is not trivial when working with hundreds of thousands of records.

In this line, we present a new practical exercise in the series "Step-by-step data exercises", in which we will learn how to explore and model open data using Apache Spark, one of the most widespread platforms for distributed processing and large-scale machine learning.

In this laboratory we will work with real data from the National System of Advertising of Subsidies and Public Aid (BDNS) and we will build a model capable of predicting the budget range of new calls based on their main characteristics.

All the code used is available in the corresponding GitHub repository so that you can run it, understand it, and adapt it to your own projects.

Link to GitHub

Context: why analyse public subsidies?

The BDNS collects detailed information on hundreds of thousands of calls published by different Spanish administrations: from ministries and regional ministries to provincial councils and city councils. This dataset is an extraordinarily valuable source for:

  • analyse the evolution of public spending,

  • understand which organisms are most active in certain areas,

  • identify patterns in the types of beneficiaries,

  • and to study the budget distribution according to sector or territory.

In our case, we will use the dataset to address a very specific question, but of great practical interest:

Can we predict the budget range of a call based on its administrative characteristics?

This capability would facilitate initial classification, decision-making support or comparative analysis within a public administration.

Objective of the exercise

The objective of the laboratory is twofold:

  1. Learn how to use Spark in a practical way:

  • Upload a real high-volume dataset

  • Perform transformations and cleaning

  • Manipulate categorical and numeric columns

  • Structuring a  machine learning pipeline

  1. Building a predictive model

We will train a classifier capable of estimating whether a call belongs to one of these ranges of low budget (up to €20k), medium (between €20 and €150k) or high (greater than €150k), based on variables such as:

  • Granting body

  • Autonomous community

  • Type of beneficiary

  • Year of publication

  • Administrative descriptions

Resources used

To complete this exercise we use:

Analytical tools

  • Python, the main language of the project
  • Google Colab, to run Spark and create Notebooks in a simple way
  • PySpark, for data processing in the cleaning and modeling stages
  • Pandas, for small auxiliary operations
  • Plotly, for some interactive visualizations

Data

Official dataset of the National System of Advertising of Subsidies (BDNS), downloaded from the subsidy portal of the Ministry of Finance.

The data used in this exercise were downloaded on August 28, 2025. The reuse of data from the National System for the Publicity of Subsidies and Public Aid is subject to the legal conditions set out in https://www.infosubvenciones.es/bdnstrans/GE/es/avisolegal.

Development of the exercise

The project is divided into several phases, following the natural flow of a real Data Science case.

5.1. Data Dump and Transformation

In this first section we are going to automatically download the subsidy dataset from the API of the portal of the National System of Publicity of Subsidies (BDNS). We will then transform the data into an optimized format such as Parquet (columnar data format) to facilitate its exploration and analysis. 

In this process we will use some complex concepts, such as:

  • Asynchronous functions: allows two or more independent operations to be processed in parallel, which makes it easier to make the process more efficient.

  • Rotary writer: when a limit on the amount of information is exceeded, the file being processed is closed and a new one is opened with an auto-incremental index (after the previous one). This avoids processing files that are too large and improves efficiency.

5.2. Exploratory analysis

The aim of this phase is to get a first idea of the characteristics of the data and its quality.

We will analyze, among others, aspects such as:

  • Which types of subsidies have the highest number of calls.

A graph with purple and white rectangles

AI-generated content may be incorrect.

  • What is the distribution of subsidies according to their purpose (i.e. Culture, Education, Promotion of employment...).
  • Which purposes add a greater budget volume.

5.3. Modelling: construction of the budget classifier

At this point, we enter the most analytical part of the exercise: teaching a machine to predict whether a new call will have a low, medium or high budget based on its administrative characteristics. To achieve this, we designed a  complete machine learning pipeline in Spark that allows us to transform the data, train the model, and evaluate it in a uniform and reproducible way.

First, we prepare all the variables – many of them categorical, such as the convening body – so that the model can interpret them. We then combine all that information into a single vector that serves as the starting point for the learning phase.

With that foundation built, we train a classification model that learns to distinguish subtle patterns in the data: which agencies tend to publish larger calls or how specific administrative elements influence the size of a grant.

Once trained, we analyze their performance from different angles. We evaluate their ability to correctly classify the three budget ranges and analyze their behavior using metrics such as accuracy or the confusion matrix.

A screenshot of a computer program

AI-generated content may be incorrect.

But we do not stop there: we also study which variables have had the greatest weight in the decisions of the model, which allows us to understand which factors seem most decisive when it comes to anticipating the budget of a call.

Conclusions of the exercise

This laboratory will allow us to see how Spark simplifies the processing and modelling of high-volume data, especially useful in environments where administrations generate thousands of records per year, and to better understand the subsidy system after analysing some key aspects of the organisation of these calls.

Do you want to do the exercise?

If you're interested in learning more about using Spark and advanced public data analysis, you can access the repository and run the full Notebook step-by-step.

Link to Notebook

Content created by Juan Benavente, senior industrial engineer and expert in technologies related to the data economy. The content and views expressed in this publication are the sole responsibility of the author.

calendar icon
Blog

We live in an age where more and more phenomena in the physical world can be observed, measured, and analyzed in real time. The temperature of a crop, the air quality of a city, the state of a dam, the flow of traffic or the energy consumption of a building are no longer data that are occasionally reviewed: they are continuous flows of information that are generated second by second.

This revolution would not be possible without cyber-physical systems (CPS), a technology that integrates sensors, algorithms and actuators to connect the physical world with the digital world. But CPS does not only generate data: it can also be fed by open data, multiplying its usefulness and enabling evidence-based decisions.

In this article, we will explore what CPS is, how it generates massive data in real time, what challenges it poses to turn that data into useful public information, what principles are essential to ensure its quality and traceability, and what real-world examples demonstrate the potential for its reuse. We will close with a reflection on the impact of this combination on innovation, citizen science and the design of smarter public policies.

What are cyber-physical systems?

A cyber-physical system is a tight integration between digital components – such as software, algorithms, communication and storage – and physical components – sensors, actuators, IoT devices or industrial machines. Its main function is to observe the environment, process information and act on it.

Unlike traditional monitoring systems, a CPS is not limited to measuring: it closes a complete loop between perception, decision, and action. This cycle can be understood through three main elements:


Figure 1. Cyber-physical systems cycle. Source: own elaboration

An everyday example that illustrates this complete cycle of perception, decision and action very well is smart irrigation, which is increasingly present in precision agriculture and home gardening systems. In this case, sensors distributed throughout the terrain continuously measure soil moisture, ambient temperature, and even solar radiation. All this information flows to the computing unit, which analyzes the data, compares it with previously defined thresholds or with more complex models – for example, those that estimate the evaporation of water or the water needs of each type of plant – and determines whether irrigation is really necessary.

When the system concludes that the floor has reached a critical level of dryness, the third element of CPS comes into play: the actuators. They are the ones who open the valves, activate the water pump or regulate the flow rate, and they do so for the exact time necessary to return the humidity to optimal levels. If conditions change—if it starts raining, if the temperature drops, or if the soil recovers moisture faster than expected—the system itself adjusts its behavior accordingly.

This whole process happens without human intervention, autonomously. The result is a more sustainable use of water, better cared for plants and a real-time adaptability that is only possible thanks to the integration of sensors, algorithms and actuators characteristic of cyber-physical systems.

CPS as real-time data factories

One of the most relevant characteristics of cyber-physical systems is their ability to generate data continuously, massively and with a very high temporal resolution. This constant production can be seen in many day-to-day situations:

  • A hydrological station can record level and flow every minute.

  • An urban mobility sensor can generate hundreds of readings per second.

  • A smart meter records electricity consumption every few minutes.

  • An agricultural sensor measures humidity, salinity, and solar radiation several times a day.

  • A mapping drone captures decimetric GPS positions in real time.

Beyond these specific examples, the important thing is to understand what this capability means for the system as a whole: CPS become true data factories, and in many cases come to function as digital twins of the physical environment they monitor. This almost instantaneous equivalence between the real state of a river, a crop, a road or an industrial machine and its digital representation allows us to have an extremely accurate and up-to-date portrait of the physical world, practically at the same time as the phenomena occur.

This wealth of data opens up a huge field of opportunity when published as open information. Data from CPS can drive innovative services developed by companies, fuel high-impact scientific research, empower citizen science initiatives that complement institutional data, and strengthen transparency and accountability in the management of public resources.

However, for all this value to really reach citizens and the reuse community, it is necessary to overcome a series of technical, organisational and quality challenges that determine the final usefulness of open data. Below, we look at what those challenges are and why they are so important in an ecosystem that is increasingly reliant on real-time generated information.

The challenge: from raw data to useful public information

Just because a CPS generates data does not mean that it can be published directly as open data. Before reaching the public and reuse companies, the information needs prior preparation , validation, filtering and documentation. Administrations must ensure that such data is understandable, interoperable and reliable. And along the way, several challenges appear.

One of the first is standardization. Each manufacturer, sensor and system can use different formats, different sample rates or its own structures. If these differences are not harmonized, what we obtain is a mosaic that is difficult to integrate. For data to be interoperable, common models, homogeneous units, coherent structures, and shared standards are needed. Regulations such as INSPIRE or the OGC (Open Geospatial Consortium) and IoT-TS standards are key so that data generated in one city can be understood, without additional transformation, in another administration or by any reuser.

The next big challenge is quality. Sensors can fail, freeze always reporting the same value, generate physically impossible readings, suffer electromagnetic interference or be poorly calibrated for weeks without anyone noticing. If this information is published as is, without a prior review and cleaning process, the open data loses value and can even lead to errors. Validation – with automatic checks and periodic review – is therefore indispensable.

Another critical point is contextualization. An isolated piece of information is meaningless. A "12.5" says nothing if we don't know if it's degrees, liters or decibels. A measurement of "125 ppm" is useless if we do not know what substance is being measured. Even something as seemingly objective as coordinates needs a specific frame of reference. And any environmental or physical data can only be properly interpreted if it is accompanied by the date, time, exact location and conditions of capture. This is all part of metadata, which is essential for third parties to be able to reuse information unambiguously.

It's also critical to address privacy and security. Some CPS can capture information that, directly or indirectly, could be linked to sensitive people, property, or infrastructure. Before publishing the data, it is necessary to apply anonymization processes, aggregation techniques, security controls and impact assessments that guarantee that the open data does not compromise rights or expose critical information.

Finally, there are operational challenges such as refresh rate and robustness of data flow. Although CPS generates information in real time, it is not always appropriate to publish it with the same granularity: sometimes it is necessary to aggregate it, validate temporal consistency or correct values before sharing it. Similarly, for data to be useful in technical analysis or in public services, it must arrive without prolonged interruptions or duplication, which requires a stable infrastructure and monitoring mechanisms.

Quality and traceability principles needed for reliable open data

Once these challenges have been overcome, the publication of data from cyber-physical systems must be based on a series of principles of quality and traceability. Without them, information loses value and, above all, loses trust.

The first is accuracy. The data must faithfully represent the phenomenon it measures. This requires properly calibrated sensors, regular checks, removal of clearly erroneous values, and checking that readings are within physically possible ranges. A sensor that reads 200°C at a weather station or a meter that records the same consumption for 48 hours are signs of a problem that needs to be detected before publication.

The second principle is completeness. A dataset should indicate when there are missing values, time gaps, or periods when a sensor has been disconnected. Hiding these gaps can lead to wrong conclusions, especially in scientific analyses or in predictive models that depend on the continuity of the time series.

The third key element is traceability, i.e. the ability to reconstruct the history of the data. Knowing which sensor generated it, where it is installed, what transformations it has undergone, when it was captured or if it went through a cleaning process allows us to evaluate its quality and reliability. Without traceability, trust erodes and data loses value as evidence.

Proper updating is another fundamental principle. The frequency with which information is published must be adapted to the phenomenon measured. Air pollution levels may need updates every few minutes; urban traffic, every second; hydrology, every minute or every hour depending on the type of station; and meteorological data, with variable frequencies. Posting too quickly can generate noise; too slow, it can render the data useless for certain uses.

The last principle is that of rich metadata. Metadata explains the data: what it measures, how it is measured, with what unit, how accurate the sensor is, what its operating range is, where it is located, what limitations the measurement has and what this information is generated for. They are not a footnote, but the piece that allows any reuser to understand the context and reliability of the dataset. With good documentation, reuse isn't just possible: it skyrockets.

Examples: CPS that reuses public data to be smarter

In addition to generating data, many cyber-physical systems also consume public data to improve their performance. This feedback makes open data a central resource for the functioning of smart territories. When a CPS integrates information from its own sensors with external open sources, its anticipation, efficiency, and accuracy capabilities are dramatically increased.

Precision agriculture: In agriculture, sensors installed in the field allow variables such as soil moisture, temperature or solar radiation to be measured. However, smart irrigation systems do not rely solely on this local information: they also incorporate weather forecasts from AEMET, open IGN maps  on slope or soil types, and climate models published as public data. By combining their own measurements with these external sources, agricultural CPS can determine much more accurately which areas of the land need water, when to plant, and how much moisture should be maintained in each crop. This fine management allows water and fertilizer savings that, in some cases, exceed 30%.

Water management: Something similar happens in water management. A cyber-physical system that controls a dam or irrigation canal needs to know not only what is happening at that moment, but also what may happen in the coming hours or days. For this reason, it integrates its own level sensors with open data on river gauging, rain and snow predictions, and even public information on ecological flows. With this expanded vision, the CPS can anticipate floods, optimize the release of the reservoir, respond better to extreme events or plan irrigation sustainably. In practice, the combination of proprietary and open data translates into safer and more efficient water management.

Impact: innovation, citizen science, and data-driven decisions

The union between cyber-physical systems and open data generates a multiplier effect that is manifested in different areas.

  • Business innovation: Companies have fertile ground to develop solutions based on reliable and real-time information. From open data and CPS measurements, smarter mobility applications, water management platforms, energy analysis tools, or predictive systems for agriculture can emerge. Access to public data lowers barriers to entry and allows services to be created without the need for expensive  private datasets, accelerating innovation and the emergence of new business models.

  • Citizen science: the combination of SCP and open data also strengthens social participation. Neighbourhood communities, associations or environmental groups can deploy low-cost sensors to complement public data and better understand what is happening in their environment. This gives rise to initiatives that measure noise in school zones, monitor pollution levels in specific neighbourhoods, follow the evolution of biodiversity or build collaborative maps that enrich official information.

  • Better public decision-making: finally, public managers benefit from this strengthened data ecosystem. The availability of reliable and up-to-date measurements makes it possible to design low-emission zones, plan urban transport more effectively, optimise irrigation networks, manage drought or flood situations or regulate energy policies based on real indicators. Without open data that complements and contextualizes the information generated by the CPS, these decisions would be less transparent and, above all, less defensible to the public.

In short, cyber-physical systems have become an essential piece of understanding and managing the world around us. Thanks to them, we can measure phenomena in real time, anticipate changes and act in a precise and automated way. But its true potential unfolds when its data is integrated into a quality open data ecosystem, capable of providing context, enriching decisions and multiplying uses.

The combination of SPC and open data allows us to move towards smarter territories, more efficient public services and more informed citizen participation. It provides economic value, drives innovation, facilitates research and improves decision-making in areas as diverse as mobility, water, energy or agriculture.

For all this to be possible, it is essential to guarantee the quality, traceability and standardization of the published data, as well as to protect privacy and ensure the robustness of information flows. When these foundations are well established, CPS not only measure the world: they help it improve, becoming a solid bridge between physical reality and shared knowledge.

Content created by Dr. Fernando Gualo, Professor at UCLM and Government and Data Quality Consultant. The content and views expressed in this publication are the sole responsibility of the author.

calendar icon
Noticia

The European open data portal has published the third volume of its Use Case Observatory, a report that compiles the evolution of data reuse projects across Europe. This initiative highlights the progress made in four areas: economic, governmental, social and environmental impact.

The closure of a three-year investigation

Between 2022 and 2025, the European Open Data Portal has systematically monitored the evolution of various European projects. The research began with an initial selection of 30 representative initiatives, which were analyzed in depth to identify their potential for impact.

After two years, 13 projects continued in the study, including three Spanish ones: PlanttesTangible Data and UniversiDATA-Lab. Its development over time was studied to understand how the reuse of open data can generate real and sustainable benefits.

The publication of volume III in October 2025 marks the closure of this series of reports, following volume I (2022) and volume II (2024). This last document offers a longitudinal view, showing how the projects have matured in three years of observation and what concrete impacts they have generated in their respective contexts.

Common conclusions

This third and final report compiles a number of key findings:

Economic impact

Open data drives growth and efficiency across industries. They contribute to job creation, both directly and indirectly, facilitate smarter recruitment processes and stimulate innovation in areas such as urban planning and digital services.

The report shows the example of:

  •  Naar Jobs (Belgium): an application for job search close to users' homes and focused on the available transport options.

This application demonstrates how open data can become a driver for regional employment and business development.

Government impact

The opening of data strengthens transparency, accountability and citizen participation.

Two use cases analysed belong to this field:

Both examples show how access to public information empowers citizens, enriches the work of the media, and supports evidence-based policymaking. All of this helps to strengthen democratic processes and trust in institutions.

Social impact

Open data promotes inclusion, collaboration, and well-being.

The following initiatives analysed belong to this field:

  • UniversiDATA-Lab (Spain): university data repository that facilitates analytical applications.
  • VisImE-360 (Italy): a tool to map visual impairment and guide health resources.
  • Tangible Data (Spain): a company focused on making physical sculptures that turn data into accessible experiences.
  • EU Twinnings (Netherlands): platform that compares European regions to find "twin cities"
  • Open Food Facts (France): collaborative database on food products.
  • Integreat (Germany): application that centralizes public information to support the integration of migrants.

All of them show how data-driven solutions can amplify the voice of vulnerable groups, improve health outcomes and open up new educational opportunities. Even the smallest effects, such as improvement in a single person's life, can prove significant and long-lasting.

Environmental impact

Open data acts as a powerful enabler of sustainability.

As with environmental impact, in this area we find a large number of use cases:

  • Digital Forest Dryads (Estonia): a project that uses data to monitor forests and promote their conservation.
  • Air Quality in Cyprus (Cyprus): platform that reports on air quality and supports environmental policies.
  • Planttes (Spain): citizen science app that helps people with pollen allergies by tracking plant phenology.
  • Environ-Mate (Ireland): a tool that promotes sustainable habits and ecological awareness.

These initiatives highlight how data reuse contributes to raising awareness, driving behavioural change and enabling targeted interventions to protect ecosystems and strengthen climate resilience.

Volume III also points to common challenges: the need for sustainable financing, the importance of combining institutional data with citizen-generated data, and the desirability of involving end-users throughout the project lifecycle. In addition, it underlines the importance of European collaboration and transnational interoperability to scale impact.

Overall, the report reinforces the relevance of continuing to invest in open data ecosystems as a key tool to address societal challenges and promote inclusive transformation.

The impact of Spanish projects on the reuse of open data

As we have mentioned, three of the use cases analysed in the Use Case Observatory have a Spanish stamp. These initiatives stand out for their ability to combine technological innovation with social and environmental impact, and highlight Spain 's relevance within the European open data ecosystem. His career demonstrates how our country actively contributes to transforming data into solutions that improve people's lives and reinforce sustainability and inclusion. Below, we zoom in on what the report says about them.

Planks

This citizen science initiative helps people with pollen allergies through real-time information about allergenic plants in bloom. Since its appearance in Volume I of the Use Case Observatory it has evolved as a participatory platform in which users contribute photos and phenological data to create a personalized risk map. This participatory model has made it possible to maintain a constant flow of information validated by researchers and to offer increasingly complete maps. With more than 1,000 initial downloads and about 65,000 annual visitors to its website, it is a useful tool for people with allergies, educators and researchers.

The project has strengthened its digital presence, with increasing visibility thanks to the support of institutions such as the Autonomous University of Barcelona and the University of Granada, in addition to the promotion carried out by the company Thigis.

Its challenges include expanding geographical coverage beyond Catalonia and Granada and sustaining data participation and validation. Therefore, looking to the future, it seeks to extend its territorial reach, strengthen collaboration with schools and communities, integrate more data in real time and improve its predictive capabilities.

Throughout this time, Planttes has established herself as an example of how citizen-driven science can improve public health and environmental awareness, demonstrating the value of citizen science in environmental education, allergy management, and climate change monitoring.

Tangible data

The project transforms datasets into physical sculptures that represent global challenges such as climate change or poverty, integrating QR codes and NFC to contextualize the information. Recognized at the EU Open Data Days 2025, Tangible Data has inaugurated its installation Tangible climate at the National Museum of Natural Sciences in Madrid.

Tangible Data has evolved in three years from a prototype project based on 3D sculptures to visualize sustainability data to become an educational and cultural platform that connects open data with society. Volume III of the Use Case Observatory reflects its expansion into schools and museums, the creation of an educational program for 15-year-old students, and the development of interactive experiences with artificial intelligence, consolidating its commitment to accessibility and social impact.

Its challenges include funding and scaling up the education programme, while its future goals include scaling up school activities, displaying large-format sculptures in public spaces,  and strengthening collaboration with artists and museums. Overall, it remains true to its mission of making data tangible, inclusive, and actionable.

UniversiDATA-Lab

UniversiDATA-Lab is a dynamic repository of analytical applications based on open data from Spanish universities, created in 2020 as a public-private collaboration and currently made up of six institutions. Its unified infrastructure facilitates the publication and reuse of data in standardized formats, reducing barriers and allowing students, researchers, companies and citizens to access useful information for education, research and decision-making.

Over the past three years, the project has grown from a prototype to a consolidated platform, with active applications such as the budget and retirement viewer, and a hiring viewer in beta. In addition, it organizes a periodic datathon that promotes innovation and projects with social impact.

Its challenges include internal resistance at some universities and the complex anonymization of sensitive data, although it has responded with robust protocols and a focus on transparency. Looking to the future, it seeks to expand its catalogue, add new universities and launch applications on emerging issues such as school dropouts, teacher diversity or sustainability, aspiring to become a European benchmark in the reuse of open data in higher education.

Conclusion

In conclusion, the third volume of the Use Case Observatory confirms that open data has established itself as a key tool to boost innovation, transparency and sustainability in Europe. The projects analysed – and in particular the Spanish initiatives Planttes, Tangible Data and UniversiDATA-Lab – demonstrate that the reuse of public information can translate into concrete benefits for citizens, education, research and the environment.

calendar icon
Application

embalses.info is a web platform that provides up-to-date information on the status of Spain’s reservoirs and dams. The application offers real-time hydrological data with weekly updates, allowing citizens, researchers, and public managers to consult water levels, capacities, and historical trends for more than 400 reservoirs organized into 16 river basins.

The application includes an interactive dashboard showing the overall status of Spanish reservoirs, an interactive (coming soon) basin map with filling levels, and detailed pages for each reservoir with weekly trend charts, comparisons with previous years, and historical records dating back to the 1980s. It features a powerful search engine, data analysis with interactive charts, and a contact form for suggestions.

From a technical standpoint, the platform uses Next.js 14+ with TypeScript on the frontend, Prisma ORM for data access, and PostgreSQL/SQL Server as the database. It is SEO-optimized with a dynamic XML sitemap, optimized meta tags, structured data, and friendly URLs. The site is fully responsive, accessible, and includes automatic light/dark mode.

The public value of the application lies in providing transparency and accessible information on Spain’s water resources, enabling farmers, public administrations, researchers, and the media to make informed decisions based on reliable and up-to-date data.

calendar icon
Blog

In any data management environment (companies, public administration, consortia, research projects), having data is not enough: if you don't know what data you have, where it is, what it means, who maintains it, with what quality, when it changed or how it relates to other data, then the value is very limited. Metadata —data about data—is essential for:

  • Visibility and access: Allow users to find what data exists and can be accessed.

  • Contextualization: knowing what the data means (definitions, units, semantics).

  • Traceability/lineage: Understanding where data comes from and how it has been transformed.

  • Governance and control: knowing who is responsible, what policies apply, permissions, versions, obsolescence.

  • Quality, integrity, and consistency: Ensuring data reliability through rules, metrics, and monitoring.

  • Interoperability:  ensuring that different systems or domains can share data, using a common vocabulary, shared definitions, and explicit relationships.

In short, metadata is the lever that turns "siloed" data into a governed information ecosystem. As data grows in volume, diversity, and velocity, its function goes beyond simple description: metadata adds context, allows data to be interpreted, and makes  it findable, accessible, interoperable, and reusable (FAIR).

In the new context driven by artificial intelligence, this metadata layer becomes even more relevant, as it provides the provenance information needed to ensure traceability, reliability, and reproducibility of results. For this reason, some recent frameworks extend these principles to FAIR-R, where the additional "R" highlights the importance of data being AI-ready, i.e. that it meets a series of technical, structural and quality requirements that optimize its use by artificial intelligence algorithms.

Thus, we are talking about enriched metadata, capable of connecting technical, semantic and contextual information to enhance machine learning, interoperability between domains and the generation of verifiable knowledge.

From traditional metadata to "rich metadata"

Traditional metadata

In the context of this article, when we talk about metadata with a traditional use, we think of catalogs, dictionaries, glossaries, database data models, and rigid structures (tables and columns). The most common types of metadata are:

  • Technical metadata: column type, length, format, foreign keys, indexes, physical locations.

  • Business/Semantic Metadata: Field Name, Description, Value Domain, Business Rules, Business Glossary Terms.

  • Operational/execution metadata: refresh rate, last load, processing times, usage statistics.

  • Quality metadata: percentage of null values, duplicates, validations.

  • Security/access metadata: access policies, permissions, sensitivity rating.

  • Lineage metadata: Transformation tracing in data pipelines .

This metadata is usually stored in repositories or cataloguing tools, often with tabular structures or relational bases, with predefined links.

Why "rich metadata"?

Rich metadata is a layer that not only describes attributes, but also:

  • They discover and infer implicit relationships, identifying links that are not expressly defined in data schemas. This allows, for example, to recognize that two variables with different names in different systems actually represent the same concept ("altitude" and "elevation"), or that certain attributes maintain a hierarchical relationship ("municipality" belongs to "province").
  • They facilitate semantic queries and automated reasoning, allowing users and machines to explore relationships and patterns that are not explicitly defined in databases. Rather than simply looking for exact matches of names or structures, rich metadata allows you to ask questions based on meaning and context. For example, automatically identifying all datasets related to "coastal cities" even if the term does not appear verbatim in the metadata.
  • They adapt and evolve flexibly, as they can be extended with new entity types, relationships, or domains without the need to redesign the entire catalog structure. This allows new data sources, models or standards to be easily incorporated, ensuring the long-term sustainability of the system.
  • They incorporate automation into tasks that were previously manual or repetitive, such as duplication detection, automatic matching of equivalent concepts, or semantic enrichment using machine learning. They can also identify inconsistencies or anomalies, improving the quality and consistency of metadata.
  • They explicitly integrate the business context, linking each data asset to its operational meaning and its role within organizational processes. To do this, they use controlled vocabularies, ontologies or taxonomies that facilitate a common understanding between technical teams, analysts and business managers.
  • They promote deeper interoperability between heterogeneous domains, which goes beyond the syntactic exchange facilitated by traditional metadata. Rich metadata adds a semantic layer that allows you to understand and relate data based on its meaning, not just its format. Thus, data from different sources or sectors – for example, Geographic Information Systems (GIS), Building Information Modeling (BIM) or the Internet of Things (IoT) – can be linked in a coherent way within a shared conceptual framework. This semantic interoperability is what makes it possible to integrate knowledge and reuse information between different technical and organizational contexts.

This turns metadata into a living asset, enriched and connected to domain knowledge, not just a passive "record".

The Evolution of Metadata: Ontologies and Knowledge Graphs

The incorporation of ontologies and knowledge graphs represents a conceptual evolution in the way metadata is described, related and used, hence we speak of enriched metadata. These tools not only document the data, but connect them within a network of meaning, allowing the relationships between entities, concepts, and contexts to be explicit and computable.

In the current context, marked by the rise of artificial intelligence, this semantic structure takes on a fundamental role:  it provides algorithms with the contextual knowledge necessary to interpret, learn and reason about data in a more accurate and transparent way. Ontologies and graphs allow AI systems not only to process information, but also  to understand the relationships between elements and to generate grounded inferences, opening the way to more explanatory and reliable models.

This paradigm shift transforms metadata into a dynamic structure, capable of reflecting the complexity of knowledge and facilitating semantic interoperability between different domains and sources of information. To understand this evolution, it is necessary to define and relate some concepts:


Ontologies

In the world of data, an ontology is a highly organized conceptual map that clearly defines:

  • What entities exist (e.g., city, river, road).
  • What properties they have (e.g. a city has a name, town, zip code).
  • How they relate to each other (e.g. a river runs through a city, a road connects two municipalities).

The goal is for people and machines to share the same vocabulary and understand data in the same way. Ontologies allow:

  • Define concepts and relationships: for example, "a plot belongs to a municipality", "a building has geographical coordinates".
  • Set rules and restrictions: such as "each building must be exactly on a cadastral plot".
  • Unify vocabularies: if in one system you say "plot" and in another "cadastral unit", ontology helps to recognize that they are analogous.
  • Make inferences: from simple data, discover new knowledge (if a building is on a plot and the plot in Seville, it can be inferred that the building is in Seville).
  • Establish a common language: they work as a dictionary shared between different systems or domains (GIS, BIM, IoT, cadastre, urban planning).

In short: an ontology is the dictionary and the rules of the game that allow different geospatial systems (maps, cadastre, sensors, BIM, etc.) to understand each other and work in an integrated way.

Knowledge Graphs

A knowledge graph is a way of organizing information as if it were a network of concepts connected to each other.

  • Nodes represent things or entities, such as a city, a river, or a building.

  • The edges (lines) show the relationships between them, for example: "is in", "crosses" or "belongs to".

  • Unlike a simple drawing of connections, a knowledge graph also explains the meaning of those relationships: it adds semantics.

A knowledge graph combines three main elements:

  1. Data: specific cases or instances, such as "Seville", "Guadalquivir River" or "Seville City Hall Building".

  2. Semantics (or ontology): the rules and vocabularies that define what kinds of things exist (cities, rivers, buildings) and how they can relate to each other.

  3. Reasoning: the ability to discover new connections from existing ones (for example, if a river crosses a city and that city is in Spain, the system can deduce that the river is in Spain).

In addition, knowledge graphs make it possible to connect information from different fields (e.g. data on people, places and companies) under the same common language, facilitating analysis and interoperability between disciplines.

In other words, a knowledge graph is the result of applying an ontology (the data model) to several individual datasets (spatial elements, other territory data, patient records or catalog products, etc.). Knowledge graphs are ideal for integrating heterogeneous data, because they do not require a previously complete rigid schema: they can be grown flexibly. In addition, they allow semantic queries and navigation with complex relationships. Here's an example for spatial data to understand the differences:

Spatial data ontology (conceptual model)

Knowledge graph (specific examples with instances)

  • Classes: River, Ocean, Building, Road, City.
  • Specific nodes: ‘Guadalquivir River’, ‘Atlantic Ocean’, ‘Seville City Hall Building’, ‘A-4 Motorway’, ‘City of Seville’, ‘City of Cadiz’
  • Relationships:
    • River → flows into → Ocean.
    • City → contains → Building
    • Road → connects → City
  • Relationships:
    • Guadalquivir River → flows into → Atlantic Ocean
    • City of Seville → contains → Seville City Hall building
    • A-4 motorway → connects → City of Seville and City of Cadiz.

Use Cases

To better understand the value of smart metadata and semantic catalogs, there is nothing better than looking at examples where they are already being applied. These cases show how the combination of ontologies and knowledge graphs makes it possible to connect dispersed information, improve interoperability and generate actionable knowledge in different contexts.

From emergency management to urban planning or environmental protection, different international projects have shown that semantics is not just theory, but a practical tool that transforms data into decisions.

Some relevant examples include:

  • LinkedGeoData that converted OpenStreetMap data into Linked Data, linking it to other open sources.
  • Virtual Singapore is a 3D digital twin that integrates geospatial, urban and real-time data for simulation and planning.
  • JedAI-spatial is a tool for interconnecting 3D spatial data using semantic relationships.
  • SOSA Ontology, a standard widely used in sensor and IoT projects for environmental observations with a geospatial component.
  • European projects on digital building permits (e.g. ACCORD), which combine semantic catalogs, BIM models, and GIS data to automatically validate building regulations.

Conclusions

The evolution towards rich metadata, supported by ontologies, knowledge graphs and FAIR-R principles, represents a substantial change in the way data is managed, connected and understood. This new approach makes metadata an active component of the digital infrastructure, capable of providing context, traceability and meaning, and not just describing information.

Rich metadata allows you to learn from data, improve semantic interoperability between domains, and facilitate more expressive queries, where relationships and dependencies can be discovered in an automated way. In this way, they favor the integration of dispersed information and support both informed decision-making and the development of more explanatory and reliable artificial intelligence models.

In the field of open data, these advances drive the transition from descriptive repositories to ecosystems of interconnected knowledge, where data can be combined and reused in a flexible and verifiable way. The incorporation of semantic context and provenance reinforces transparency, quality and responsible reuse.

This transformation requires, however, a progressive and well-governed approach: it is essential to plan for systems migration, ensure semantic quality, and promote the participation of multidisciplinary communities.

In short, rich metadata is the basis for moving from isolated data to connected and traceable knowledge, a key element for interoperability, sustainability and trust in the data economy.

Content prepared by Mayte Toscano, Senior Consultant in Data Economy Technologies. The contents and points of view reflected in this publication are the sole responsibility of the author.

calendar icon
Blog

Open data has great potential to transform the way we interact with our cities. As they are available to all citizens, they allow the development of applications and tools that respond to urban challenges such as accessibility, road safety or citizen participation. Facilitating access to this information not only drives innovation, but also contributes to improving the quality of life in urban environments.

This potential becomes even more relevant if we consider the current context. Accelerated urban growth has brought with it new challenges, especially in the area of public health. According to data from the United Nations, it is estimated that by 2050 more than 68% of the world's population will live in cities. Therefore, the design of healthy urban environments is a priority in which open data is consolidated as a key tool: it allows  planning more resilient, inclusive and sustainable cities, putting people's well-being at the center of decisions. In this post, we tell you what healthy urban environments are and how open data can help build and maintain them.

What are Healthy Urban Environments? Uses and examples

Healthy urban environments go beyond simply the absence of pollution or noise. According to the World Health Organization (WHO), these spaces must actively promote healthy lifestyles, facilitate physical activity, encourage social interaction, and ensure equitable access to basic services. As established in the Ministry of Health's "Guide to Planning Healthy Cities", these environments are characterized by three key elements:

  • Cities designed for walking: they must be spaces that prioritize pedestrian and cycling mobility, with safe, accessible and comfortable streets that invite active movement.

  • Incorporation of nature: they integrate green areas, blue infrastructure and natural elements that improve air quality, regulate urban temperature and offer spaces for recreation and rest.

  • Meeting and coexistence spaces: they have areas that facilitate social interaction, reduce isolation and strengthen the community fabric.

The role of open data in healthy urban environments

In this scenario, open data acts as the nervous system of smart cities, providing valuable information on usage patterns, citizen needs, and public policy effectiveness. Specifically, in the field of healthy urban spaces, data from:

  • Analysis of physical activity patterns: data on mobility, use of sports facilities and frequentation of green spaces reveal where and when citizens are most active, identifying opportunities to optimize existing infrastructure.

  • Environmental quality monitoring: urban sensors that measure air quality, noise levels, and temperature provide real-time information on the health conditions of different urban areas.

  • Accessibility assessment: public transport, pedestrian infrastructure and service distribution allow for the identification of barriers to access and the design of more inclusive solutions.

  • Informed citizen participation: open data platforms facilitate participatory processes where citizens can contribute local information and collaborate in decision-making.

The Spanish open data ecosystem has solid platforms that feed healthy urban space projects. For example, the Madrid City Council's Open Data Portal offers real-time information on air quality as well as a complete inventory of green areas. Barcelona also publishes data on air quality, including the locations and characteristics of measuring stations.

These portals not only store information, but structure it in a way that developers, researchers and citizens can create innovative applications and services.

Use Cases: Applications That Reuse Open Data

Several projects demonstrate how open data translates into tangible improvements for urban health. On the one hand, we can highlight some applications or digital tools such as:

  • AQI Air Quality Index: uses government data to provide real-time information on air quality in different Spanish cities.

  • GV Aireprocesses official air quality data to generate citizen alerts and recommendations.

  • National Air Quality Indexcentralizes information from measurement stations throughout the country.

  • Valencia Verdeuses municipal data to show the location and characteristics of parks and gardens in Valencia.

On the other hand, there are initiatives that combine multisectoral open data to offer solutions that improve the interaction between cities and citizens. For example:

  • Supermanzanas Programuses maps showing air quality pollution levels  and traffic data available in open formats such as CSV and GeoPackage from Barcelona Open Data and Barcelona City Council to identify streets where reducing road traffic can maximize health benefits, creating safe spaces for pedestrians and cyclists.

  • The DataActive platform: seeks to establish an international infrastructure in which researchers, public and private sports entities participate. The topics it addresses include land management, urban planning, sustainability, mobility, air quality and environmental justice. It aims to promote more active, healthy and accessible urban environments through the implementation of strategies based on open data and research.

Data availability is complemented by advanced visualization tools. The Madrid Spatial Data Infrastructure (IDEM) offers geographic viewers specialized in air quality and the National Geographic Institute (IGN) offers the national street map CartoCiudad with information on all cities in Spain.

Effective governance and innovation ecosystem

However, the effectiveness of these initiatives depends on new governance models that integrate multiple actors. To achieve proper coordination between public administrations at different levels, private companies, third sector organizations and citizens, it is essential to have quality open data.

Open data not only powers specific applications but creates an entire ecosystem of innovation. Independent developers, startups, research centers, and citizen organizations use this data to:

  • Develop urban health impact studies.

  • Create participatory planning tools.

  • Generate early warnings about environmental risks.

  • Evaluate the effectiveness of public policies.

  • Design personalized services according to the needs of different population groups.

Healthy urban spaces projects based on open data generate multiple tangible benefits:

  • Efficiency in public management: data makes it possible to optimize the allocation of resources, prioritize interventions and evaluate their real impact on citizen health.

  • Innovation and economic development: the open data ecosystem stimulates the creation of innovative startups and services that improve the quality of urban life, as demonstrated by the multiple applications available in datos.gob.es.

  • Transparency and participation: the availability of data facilitates citizen control and strengthens democratic decision-making processes.

  • Scientific evidence: Urban health data contributes to the development of evidence-based public policies and the advancement of scientific knowledge.

  • Replicability: successful solutions can be adapted and replicated in other cities, accelerating the transformation towards healthier urban environments.

In short, the future of our cities depends on our ability to integrate technology, citizen participation and innovative public policies. The examples analyzed demonstrate that open data is not just information; They are the foundation for building urban environments that actively promote health, equity, and sustainability.

calendar icon
Blog

Open data from public sources has evolved over the years, from being simple repositories of information to constituting dynamic ecosystems that can transform public governance. In this context, artificial intelligence (AI) emerges as a catalytic technology that benefits from the value of open data and exponentially enhances its usefulness. In this post we will see what the mutually beneficial symbiotic relationship between AI and open data looks like.

Traditionally, the debate on open data has focused on portals: the platforms on which governments publish information so that citizens, companies and organizations can access it. But the so-called "Third Wave of Open Data," a term by New York University's GovLab, emphasizes that it is no longer enough to publish datasets on demand or by default. The important thing is to think about the entire ecosystem: the life cycle of data, its exploitation, maintenance and, above all, the value it generates in society.

What role can open data play in AI?

In this context, AI appears as a catalyst capable of automating tasks, enriching open government data (DMOs), facilitating its understanding and stimulating collaboration between actors.

Recent research, developed by European universities, maps how this silent revolution is happening. The study proposes a classification of uses according to two dimensions:

  1. Perspective, which in turn is divided into two possible paths:

    1. Inward-looking (portal): The focus is on the internal functions of data portals.

    2. Outward-looking (ecosystem): the focus is extended to interactions with external actors (citizens, companies, organizations).

  2. Phases of the data life cycle, which can be divided into pre-processing, exploration, transformation and maintenance.

In summary, the report identifies these eight types of AI use in government open data, based on perspective and phase in the data lifecycle.

Table showing how AI supports different roles across the Open Government Data (OGD) lifecycle from two perspectives: "Inward Looking – Portal Perspective" and "Outward Looking – Ecosystem Perspective." The lifecycle stages are Pre-processing, Exploration, Transformation, and Maintenance.  Portal Perspective:  Pre-processing: AI acts as a Portal Curator, automating tagging, labeling, categorization, quality checks, and anonymization of sensitive data using privacy risk models.  Exploration: AI serves as a Portal Explorer, enhancing user access to data via natural language interfaces, improved search, and chatbots.  Transformation: AI functions as a Portal Linker, converting and aggregating data, such as transforming it into Linked OGD using machine learning.  Maintenance: AI operates as a Portal Monitor, ensuring compliance with standards, detecting anomalies, and improving metadata quality.  Ecosystem Perspective:  Pre-processing: AI is an Ecosystem Data Retriever, sourcing external data (e.g., legal texts) to enrich the OGD ecosystem.  Exploration: AI acts as an Ecosystem Connector, linking stakeholders by recommending relevant datasets and assets.  Transformation: AI becomes an Ecosystem Value Developer, enabling services and products using OGD, such as AI-powered dashboards and analytics suggestions.  Maintenance: AI serves as an Ecosystem Engager, encouraging ongoing stakeholder participation by predicting portal usage and enhancing engagement.

Figure 1. Eight uses of AI to improve government open data. Source: presentation “Data for AI or AI for data: artificial intelligence as a catalyst for open government ecosystems”, based on the report of the same name, from EU Open Data Days 2025.

A continuación, se detalla cada uno de estos usos:

1. Portal curator

This application focuses on pre-processing data within the portal. AI helps organize, clean, anonymize, and tag datasets before publication. Some examples of tasks are:

  • Automation and improvement of data publication tasks.

  • Performing auto-tagging and categorization functions.

  • Data anonymization to protect privacy.

  • Automatic cleaning and filtering of datasets.

  • Feature extraction and missing data handling.

2. Ecosystem data retriever

Also in the pre-processing phase, but with an external focus, AI expands the coverage of portals by identifying and collecting information from diverse sources. Some tasks are:

  • Retrieve structured data from legal or regulatory texts.

  • News mining to enrich datasets with contextual information.

  • Integration of urban data from sensors or digital records.

  • Discovery and linking of heterogeneous sources.

  • Conversion of complex documents into structured information.

3. Portal explorer

In the exploration phase, AI systems can also make it easier to find and interact with published data, with a more internal approach. Some use cases:

  • Develop semantic search engines to locate datasets.

  • Implement chatbots that guide users in data exploration.
  • Provide natural language interfaces for direct queries.

  • Optimize the portal's internal search engines.

  • Use language models to improve information retrieval.

4. Ecosystem connector

Operating also in the exploration phase, AI acts as a bridge between actors and ecosystem resources. Some examples are:

  • Recommend relevant datasets to researchers or companies.

  • Identify potential partners based on common interests.

  • Extract emerging themes to support policymaking.

  • Visualize data from multiple sources in interactive dashboards.

  • Personalize data suggestions based on social media activity.

5. Portal linker

This functionality focuses on the transformation of data within the portal. Its function is to facilitate the combination and presentation of information for different audiences. Some tasks are:

  • Convert data into knowledge graphs (structures that connect related information, known as Linked Open Data).

  • Summarize and simplify data with NLP (Natural Language Processing) techniques.

  • Apply automatic reasoning to generate derived information.

  • Enhance multivariate visualization of complex datasets.

  • Integrate diverse data into accessible information products.

6. Ecosystem value developer

In the transformation phase and with an external perspective, AI generates products and services based on open data that provide added value. Some tasks are:

  • Suggest appropriate analytical techniques based on the type of dataset.

  • Assist in the coding and processing of information.

  • Create dashboards based on predictive analytics.

  • Ensure the correctness and consistency of the transformed data.

  • Support the development of innovative digital services.

7. Portal monitor

It focuses on portal maintenance, with an internal focus. Their role is to ensure quality, consistency, and compliance with standards. Some tasks are:

  • Detect anomalies and outliers in published datasets.

  • Evaluate the consistency of metadata and schemas.

  • Automate data updating and purification processes.

  • Identify incidents in real time for correction.

  • Reduce maintenance costs through intelligent monitoring.

8. Ecosystem engager

And finally, this function operates in the maintenance phase, but outwardly. It seeks to promote citizen participation and continuous interaction. Some tasks are:

  • Predict usage patterns and anticipate user needs.

  • Provide personalized feedback on datasets.

  • Facilitate citizen auditing of data quality.

  • Encourage participation in open data communities.

  • Identify user profiles to design more inclusive experiences.

What does the evidence tell us?

The study is based on a review of more than 70 academic papers examining the intersection between AI and OGD (open government data). From these cases, the authors observe that:

  • Some of the defined profiles, such as portal curator, portal explorer and portal monitor, are relatively mature and have multiple examples in the literature.

  • Others, such as ecosystem value developer and ecosystem engager, are less explored, although they have the most potential to generate social and economic impact.

  • Most applications today focus on automating specific tasks, but there is a lot of scope to design more comprehensive architectures, combining several types of AI in the same portal or across the entire data lifecycle.

From an academic point of view, this typology provides a common language and conceptual structure to study the relationship between AI and open data. It allows identifying gaps in research and guiding future work towards a more systemic approach.

In practice, the framework is useful for:

  • Data portal managers: helps them identify what types of AI they can implement according to their needs, from improving the quality of datasets to facilitating interaction with users.

  • Policymakers: guides them on how to design AI adoption strategies in open data initiatives, balancing efficiency, transparency, and participation.

  • Researchers and developers: it offers them a map of opportunities to create innovative tools that address specific ecosystem needs.

Limitations and next steps of the synergy between AI and open data

In addition to the advantages, the study recognizes some pending issues that, in a way, serve as a roadmap for the future. To begin with, several of the applications that have been identified are still in early stages or are conceptual. And, perhaps most relevantly, the debate on the risks and ethical dilemmas of the use of AI in open data has not yet been addressed in depth: bias, privacy, technological sustainability.

In short, the combination of AI and open data is still a field under construction, but with enormous potential. The key will be to move from isolated experiments to comprehensive strategies, capable of generating social, economic and democratic value. AI, in this sense, does not work independently of open data: it multiplies it and makes it more relevant for governments, citizens and society in general.

calendar icon
Blog

In a world where immediacy is becoming increasingly important, predictive commerce has become a key tool for anticipating consumer behaviors, optimizing decisions, and offering personalized experiences. It's no longer just about reacting to the customer's needs, it's  about predicting what they want before they even know it.

In this article we are going to explain what predictive commerce is and the importance of open data in it, including real examples.

What is predictive commerce?

Predictive commerce is a strategy based on data analysis to anticipate consumers' purchasing decisions. It uses artificial intelligence algorithms and statistical models to identify patterns of behavior, preferences, and key moments in the consumption cycle. Thanks to this, companies can know relevant information about which products will be most in demand, when and where a purchase will be made or which customers are most likely to purchase a certain brand.

This is of great importance in a market like the current one, where there is a saturation of products and competition. Predictive commerce allows companies to adjust inventories, prices, marketing campaigns or logistics in real time, becoming a great competitive advantage.

The role of open data in predictive commerce

These models are fed by large volumes of data: purchase history, web browsing, location or comments on social networks, among others. But the more accurate and diverse the data, the more fine-tuned the predictions will be. This is where open data plays a fundamental role, as it allows new variables to be taken into account when defining consumer behavior. Among other things, open data can help us:

  • Enrich prediction models with external information such as demographic data, urban mobility or economic indicators.
  • Detect regional patterns that influence consumption, such as the impact of climate on the sale of certain seasonal products.
  • Design more inclusive strategies by incorporating public data on the habits and needs of different social groups.

The following table shows examples of datasets available in datos.gob.es that can be used for these tasks, at a national level, although many autonomous communities and city councils also publish this type of data along with others also of interest.

Dataset Example Possible use
Municipal register by age and sex National Institute of Statistics (INE) Segment populations by territory, age, and gender. It is useful for customizing campaigns based on the majority population of each municipality or forecasting demand by demographic profile.
Household Budget Survey National Institute of Statistics (INE) It offers information on the average expenditure per household in different categories.  It can help anticipate consumption patterns by socioeconomic level.
Consumer Price Index (CPI) National Institute of Statistics (INE) It disaggregates the CPI by territory, measuring how the prices of goods and services vary in each Spanish province. It is useful for adjusting prices and market penetration strategies.
Real-time weather warnings Ministry for the Ecological Transition and Demographic Challenge Alert of adverse weather phenomena. It allows correlating weather with product sales (clothing, beverages, heating, etc.).
Education and Digital Literacy Statistics National Institute of Statistics (INE) Provides information on internet usage in the last 3 months. It allows digital gaps to be identified and communication or training strategies to be adapted.
Facts about tourist stays National Institute of Statistics (INE) It reports on the average stay of tourists by autonomous community. It helps to anticipate demand in areas with high seasonal influx, such as local products or tourist services.
Number of prescriptions and pharmaceutical expenditure General Mutual Society for Civil Servants of the State (MUFACE) It offers information on the consumption of medicines by province and age subgroups. It facilitates the estimation of sales of other related medical and parapharmacy products by estimating how many users will go to the pharmacy.

Real-world use cases

For years, we have already found companies that are using this type of data to optimize their business strategies. Let's look at some examples:

  • Using weather data to optimize stock in large supermarkets

 Walmart department stores use AI algorithms that incorporate weather data (such as heat waves, storms, or temperature changes) along with historical sales data, events, and digital trends, to forecast demand at a granular level and optimize inventories. This allows the replenishment of critical products to be automatically adjusted according to anticipated weather patterns. In addition, Walmart mentions that its system considers "future data" such as macroweather weather patterns, economic trends, and local demographics to anticipate demand and potential supply chain disruptions.

 Tesco also uses public weather data in its predictive models. This allows you to anticipate buying patterns, such as that for every 10°C increase in temperature, barbecue sales increase by up to 300%. In addition, Tesco receives local weather forecasts up to three times a day, connecting them with data on 18 million products and the type of customers in each store. This information is shared with your suppliers to adjust shipments and improve logistics efficiency.

  • Using demographic data to decide the location of premises

For years, Starbucks has turned to predictive analytics to plan its expansion. The company uses geospatial intelligence platforms, developed with GIS technology, to combine multiple sources of information – including open demographic and socioeconomic data such as population density, income level, mobility patterns, public transport or the type of nearby businesses – along with its own sales history. Thanks to this integration, you can predict which locations have the greatest potential for success, avoiding competition between stores and ensuring that each new store is located in the most suitable environment.

Domino's Pizza also used similar models to analyse whether opening a new location in one London neighbourhood would be successful and how it would affect other nearby locations, considering buying patterns and local demographics.

This approach makes it possible to predict customer flows and maximize profitability through more informed location decisions.

  • Socioeconomic data for pricing based on demographics

An interesting example can be found in SDG Group, an international consulting firm specialising in advanced analytics for retail. The company has developed solutions that allow prices and promotions to be adjusted taking into account the demographic and socioeconomic characteristics of each area – such as the consumer base, location or the size of the point of sale. Thanks to these models, it is possible to estimate the elasticity of demand and design dynamic pricing strategies adapted to the real context of each area, optimizing both profitability and the shopping experience.

The future of predictive commerce

The rise of predictive commerce has been fueled by the advancement of artificial intelligence and the availability of data, both open and private. From choosing the ideal place to open a store to efficiently managing inventory, public data combined with advanced analytics allows you to anticipate consumer behaviors and needs with increasing accuracy.

However, there are still important challenges to be faced: the heterogeneity of data sources, which in many cases lack common standards; the need for robust technologies and infrastructures that allow open information to be integrated with companies' internal systems; and, finally, the challenge of ensuring ethical and transparent use, which respects people's privacy and avoids the generation of bias in models.

Overcoming these challenges will be key for predictive commerce to unfold its full potential and become a strategic tool for companies of all sizes. On this path, open data will play a fundamental role as a driver of innovation, transparency and competitiveness in the trade of the future..

calendar icon
Blog

Artificial intelligence (AI) assistants are already part of our daily lives: we ask them the time, how to get to a certain place or we ask them to play our favorite song. And although AI, in the future, may offer us infinite functionalities, we must not forget that linguistic diversity is still a pending issue.

In Spain, where Spanish coexists with co-official languages such as Basque, Catalan, Valencian and Galician, this issue is especially relevant. The survival and vitality of these languages in the digital age depends, to a large extent, on their ability to adapt and be present in emerging technologies. Currently, most virtual assistants, automatic translators or voice recognition systems do not understand all the co-official languages. However, did you know that there are collaborative projects to ensure linguistic diversity?

In this post we tell you about the approach and the greatest advances of some initiatives that are building the digital foundations necessary for the co-official languages in Spain to also thrive in the era of artificial intelligence.

ILENIA, the coordinator of multilingual resource initiatives in Spain

The models that we are going to see in this post share a focus because they are part of ILENIA, a state-level coordinator that connects the individual efforts of the autonomous communities. This initiative brings together the projects BSC-CNS (AINA), CENID (VIVES), HiTZ (NEL-GAITU) and the University of Santiago de Compostela (NÓS), with the aim of generating digital resources that allow the development of multilingual applications in the different languages of Spain.

The success of these initiatives depends fundamentally on citizen participation. Through platforms such as  Mozilla's Common Voice, any speaker can contribute to the construction of these linguistic resources through different forms of collaboration:

  • Spoken Read: Collecting different ways of speaking through voice donations of a specific text.
  • Spontaneous speech: creates  real and organic datasets as a result of conversations with prompts.
  • Text in language: collaborate in the transcription of audios or in the contribution of textual content, suggesting new phrases or questions to enrich the corpora.

All resources are published under free licenses such as CC0, allowing them to be used free of charge by researchers, developers and companies.

The challenge of linguistic diversity in the digital age

Artificial intelligence systems learn from the data they receive during their training. To develop technologies that work correctly in a specific language, it is essential to have large volumes of data: audio recordings, text corpora and examples of real use of the language.

In other publications of datos.gob.es we have addressed the functioning of foundational models and initiatives in Spanish such as ALIA, trained with large corpus of text such as those of the Royal Spanish Academy.

Both posts explain why language data collection is not a cheap or easy task. Technology companies have invested massively in compiling these resources for languages with large numbers of speakers, but Spanish co-official languages face a structural disadvantage. This has led to many models not working properly or not being available in Valencian, Catalan, Basque or Galician.

However, there are collaborative and open data initiatives that allow the creation of quality language resources. These are the projects that several autonomous communities have launched, marking the way towards a multilingual digital future.

On the one hand, the Nós en Galicia Project creates oral and conversational resources in Galician with all the accents and dialectal variants to facilitate integration through tools such as GPS, voice assistants or ChatGPT. A similar purpose is that of Aina in Catalonia, which also offers an academic platform and a laboratory for developers or Vives in the Valencian Community. In the Basque Country there is also the Euskorpus project  , which aims to constitute a quality text corpus in Basque. Let's look at each of them.

Proyecto Nós, a collaborative approach to digital Galician

The project has already developed three operational tools: a multilingual neural translator, a speech recognition system that converts speech into text, and a speech synthesis application. These resources are published under open licenses, guaranteeing their free and open access for researchers, developers and companies. These are its main features:

  • Promoted by: the Xunta de Galicia and the University of Santiago de Compostela.
  • Main objective: to create oral and conversational resources in Galician that capture the dialectal and accent diversity of the language.
  • How to participate: The project accepts voluntary contributions both by reading texts and by answering spontaneous questions.

Aina, towards an AI that understands and speaks Catalan

With a similar approach to the Nós project, Aina seeks to facilitate the integration of Catalan into artificial intelligence language models.

It is structured in two complementary aspects that maximize its impact:

  • Aina Tech focuses on facilitating technology transfer to the business sector, providing the necessary tools to automatically translate websites, services and online businesses into Catalan.
  • Aina Lab promotes the creation of a community of developers through initiatives such as Aina Challenge, promoting collaborative innovation in Catalan language technologies. Through this call , 22 proposals have already been selected with a total amount of 1 million to execute their projects.

The characteristics of the project are:

Vives, the collaborative project for AI in Valencian

On the other hand, Vives collects voices speaking in Valencian to serve as training for AI models.

Gaitu: strategic investment in the digitalisation of the Basque language

In Basque, we can highlight Gaitu,  which seeks  to collect voices speaking in Basque in order to train AI models. Its characteristics are:

Benefits of Building and Preserving Multilingual Language Models

The digitization projects of the co-official languages transcend the purely technological field to become tools for digital equity and cultural preservation. Its impact is manifested in multiple dimensions:

  • For citizens: these resources ensure that speakers of all ages and levels of digital competence can interact with technology in their mother tongue, removing barriers that could exclude certain groups from the digital ecosystem.
  • For the business sector: the availability of open language resources makes it easier for companies and developers to create products and services in these languages without assuming the high costs traditionally associated with the development of language technologies.
  • For the research fabric, these corpora constitute a fundamental basis for the advancement of research in natural language processing and speech technologies, especially relevant for languages with less presence in international digital resources.

The success of these initiatives shows that it is possible to build a digital future where linguistic diversity is not an obstacle but a strength, and where technological innovation is put at the service of the preservation and promotion of linguistic cultural heritage.

calendar icon