On the occasion of Open Data Day 2026, the Open Knowledge Foundation (OKFN) held an online conference entitled "The Future of Open Data", an open-access event that brought together a diverse community of data professionals from governments, civil society organizations, universities, newsrooms and activist collectives. From datos.gob.es we follow the day live and share here a summary of the main ideas that marked the day.
Three approaches to understanding the role of open data in the age of AI
The conference was structured around three main thematic blocks:
- Navigating open data regulation in the public interest: interventions by representatives of academia, public policy makers and researchers from different countries who discussed the regulatory framework of open data in the current context of AI.
- Community Voices, Open Data, and AI: Short presentations of concrete projects from around the world exploring the intersection between open data and artificial intelligence, from tools for judicial analysis to citizen science dashboards.
- 20 years of CKAN: The future in the age of AI: reflections on the two decades of history of open data and CKAN, on the past, present and challenges to come.
Overall, the day combined political reflection, technical innovation and community vision, with voices from Spain, France, India, Ukraine, Kenya, the United States and Australia, among other countries. And the common thread of the event was the question that today runs through digital policy forums around the world: what is the role of open data in an ecosystem increasingly dominated by artificial intelligence?
Thematic block 1. A movement that was born out of activism
In its origins, the open data movement began in conversations between activists committed to transparency, accountability and access to public information to citizens.
This episode of the datos.gob.es podcast also discusses the origin of open data and its evolution
Today, however, the movement is more diversified because there are now more agents that influence, such as artificial intelligence. There is also a regulatory context that functions as a framework in the development of the open data movement.
The topic of regulation and governance was the backbone of the first session of the event, moderated by Renata Ávila, CEO of OKFN. The following participated in it:
- Jonathan Gray, author of the book Public Data Cultures (Polity, 2025) and professor at King's College London, presented his work as a reference source for reflecting on data as an open asset: how this openness is built and how it can help us respond to great collective challenges. His proposal is that public data is not simply technical information, but the result of cultural and political decisions about what we tell, how we tell it, and for whom.
- Renato Berrino Malaccorto, research manager of the Open Data Charter, stressed that the openness of data is fundamental for the ethical development of AI. Without open, auditable and quality data, it is not possible to build artificial intelligence systems that are accountable to citizens. At the same time, he pointed out that there is a real capacity gap: many organizations and governments lack the technical and human resources necessary to harness the potential of open data in this new context.
- Ruth del Campo, general director of data at the Ministry for Digital Transformation and Public Function of the Government of Spain, offered a very relevant institutional perspective for our context. He recalled that "The data economy is part of the economy", and underlined the boost that the Government is giving to initiatives such as datos.gob.es and Impulsa Data (aimed at modernizing internal management and feeding the Sectoral Data Spaces). He also stressed the importance of the data strategy incorporating AI ready principles, guaranteeing adequate resources – such as linguistic corpora – to train AI models efficiently and without generating new inequalities. Finally, he pointed out the need to simplify and harmonize data regulations, a process in which progress is already being made at the European level.
The panel's underlying message was clear: open data needs to be placed at the heart of the digital agenda, adequately resourced and explicitly connected to public AI strategies. AI of social interest cannot be built without open data; and open data without a vision of AI risks being relegated to irrelevance.
Thematic block 2. Lightning Talks: Projects That Demonstrate the Potential of Open Data
The second session of the day brought together short presentations of concrete projects that illustrated how open data and artificial intelligence can work together in the public interest. Some examples are:
- Ihor Samokhodskyi from the Ukrainian initiative Policy Genome presented an open data-based analysis tool for judicial practice that demonstrates how public information, combined with AI techniques, can contribute to transparency and the improvement of justice systems.
- Javier Conde, from the Polytechnic University of Madrid, presented the proposal he has developed together with his colleagues Andrés Muñoz-Arcentales and Álvaro Alonso to improve the integration of European open data in data spaces. This project facilitates the automatic generation of high-quality metadata, thus ensuring the interoperability and reuse of datasets. A directly relevant initiative for the improvement of portals such as datos.gob.es and its connection with data.europa.eu.
- Renu Kumari, from #semanticClimate and Frictionless Data (India), presented a project that works at the intersection between open climate data and semantic tools to make scientific literature and data on climate change more accessible, structured and reusable.
- Richard Muraya, from The Demography Project (Kenya), presented Uhai/Life, a citizen science dashboard that aggregates open data on natural resource use to provide insight into human and environmental well-being at the local scale. An example of how open data can empower communities to tell their own story, without relying on external narratives or institutions.
Figure 1. Presentation slide of one of the presentations of the event. Source: conference "The Future of Open Data" organized by OKFN.
- Finally, Sayantika Banik from DataJourney (India) showed an autonomous analytics assistant capable of transforming open datasets into easily understandable information.
Thematic block 3. Round table: 20 years of CKAN and the challenges of the future
The longest session of the day was also the most reflective: a round table to celebrate two decades of CKAN, the open data portal management tool born within OKFN and which today feeds hundreds of data portals around the world, including datos.gob.es. The panel was moderated by Jamaica Jones, CKAN/POSE community manager at the University of Pittsburgh. The following participated in this table:
- Rufus Pollock, founder of OKFN and Datopian, and co-founder of Life Itself, stressed the importance of keeping power in the hands of citizens and of betting on open source as a driver of economic development and shared knowledge. For Pollock, AI must be understandable and accessible to most, not just large corporations.
- Joel Natividad is Co-CEO and co-founder of datHere, a company specializing in open data solutions and analytics tools for the public sector. As a CKAN user for more than 15 years, he insisted on one idea: "We have always tried to learn how machines think, and now it is machines that are learning how humans think."
- Patricio Del Boca is Tech Lead and Open Activist at OKFN, where he leads the technical development of initiatives related to CKAN and open data infrastructures. He shared OKFN's next steps for 2026: building more community and developing use cases that demonstrate the practical value of open data in the current context.
- Andrea Borruso is an expert in Geographic Information Systems (GIS) and open data. As president of onData, an Italian non-profit association that promotes access to and reuse of public data, he highlighted data activism and citizen science as drivers of technological development that involve the community.
- Antonin Garrone of data.gouv.fr, France's national open data portal, brought to the table the perspective of an established portal that has spent years exploring how to integrate new technologies without losing sight of its public service mission.
- Steven De Costa is CEO of Link Digital, an Australian company specializing in the implementation and development of CKAN-based solutions, and Co-Steward of the CKAN project. His perspective combined technical vision with a concern to maintain an open and participatory governance model.
- Finally, Public AI research engineer Mohsin Yousufi insisted on the intersection between artificial intelligence, public data infrastructures, and technology policies, exploring how AI systems can be designed and governed to serve the public interest.
Final Thought: Open Data as Democratic Infrastructure
If there is one conclusion that ran through all the sessions of Open Data Day 2026, it is that open data is not in crisis, but at a decisive moment. The opportunities offered by artificial intelligence are real, but so are the risks. It is important to know them in order to know how to address them. Some of those that were mentioned are:
- Prevent public data from becoming the raw material of private systems without transparency or accountability.
- Preserve the political will to keep open data portals functional and updated.
- Bridging the digital skills and training gap to facilitate the participation of all countries and communities in the new AI ecosystem.
In the face of this, the message of the event was one of mobilization: it is necessary to vindicate open data as a democratic infrastructure, explicitly connect data policies with public AI strategies, and ensure that the benefits of artificial intelligence reach all citizens, and not only those who already have access to technological resources.
From datos.gob.es we will continue to work in that direction, and we celebrate the existence of spaces such as Open Data Day to remind us why we started and where we want to go.
You can watch the event video again here
"I'm going to upload a CSV file for you. I want you to analyze it and summarize the most relevant conclusions you can draw from the data". A few years ago, data analysis was the territory of those who knew how to write code and use complex technical environments, and such a request would have required programming or advanced Excel skills. Today, being able to analyse data files in a short time with AI tools gives us great professional autonomy. Asking questions, contrasting preliminary ideas and exploring information first-hand changes our relationship with knowledge, especially because we stop depending on intermediaries to obtain answers. Gaining the ability to analyze data with AI independently speeds up processes, but it can also cause us to become overconfident in conclusions.
Based on the example of a raw data file, we are going to review possibilities, precautions and basic guidelines to explore the information without assuming conclusions too quickly.
The file:
To show an example of data analysis with AI we will use a file from the National Institute of Statistics (INE) that collects information on tourist flows in Europe, specifically on occupancy in rural tourism accommodation. The data file contains information from January 2001 to December 2025. It contains disaggregations by sex, age and autonomous community or city, which allows comparative analyses to be carried out over time. At the time of writing, the last update to this dataset was on January 28, 2026.

Figure 1. Dataset information. Source: National Institute of Statistics (INE).
1. Initial exploration
For this first exploration we are going to use a free version of Claude, the AI-based multitasking chat developed by Anthropic. It is one of the most advanced language models in reasoning and analysis benchmarks, which makes it especially suitable for this exercise, and it is the most widely used option currently by the community to perform tasks that require code.
Let's think that we are facing the data file for the first time. We know in broad strokes what it contains, but we do not know the structure of the information. Our first prompt, therefore, should focus on describing it:
PROMPT: I want to work with a data file on occupancy in rural tourism accommodation. Explain to me what structure the file has: what variables it contains, what each one measures and what possible relationships exist between them. It also points out possible missing values or elements that require clarification.

Figure 2. Initial exploration of the data file with Claude. Source: Claude.
Once Claude has given us the general idea and explanation of the variables, it is good practice to open the file and do a quick check. The objective is to assess that, at a minimum, the number of rows, the number of columns, the names of the variables, the time period and the type of data coincide with what the model has told us.
If we detect any errors at this point, the LLM may not be reading the data correctly. If after trying in another conversation the error persists, it is a sign that there is something in the file that makes it difficult to read automatically. In this case, it is best not to continue with the analysis, as the conclusions will be very apparent, but will be based on misinterpreted data.
2. Anomaly management
Second, if we have discovered anomalies, it is common to document them and decide how to handle them before proceeding with the analysis. We can ask the model to suggest what to do, but the final decisions will be ours. For example:
- Missing values: if there are empty cells, we need to decide whether to fill them with an "average" value from the column or simply delete those rows.
- Duplicates: we have to eliminate repeated rows or rows that do not provide new information.
- Formatting errors or inconsistencies: we must correct these so that the variables are coherent and comparable. For example, dates represented in different formats.
- Outliers: if a number appears that does not make sense or is exaggeratedly different from the rest, we have to decide whether to correct it, ignore it or treat it as it is.

Figure 3. Example of missing values analysis with Claude. Source: Claude.
In the case of our file, for example, we have detected that in Ceuta and Melilla the missing values in the Total variable are structural, there is no rural tourism registered in these cities, so we could exclude them from the analysis.
Before making the decision, a good practice at this point is to ask the LLM for the pros and cons of modifying the data. The answer can give us some clue as to which is the best option, or indicate some inconvenience that we had not taken into account.

Figure 4. Claude's analysis on the possibility of eliminating or not securities. Source: Claude.
If we decide to go ahead and exclude the cities of Ceuta and Melilla from the analysis, Claude can help us make this modification directly on the file. The prompt would be as follows:
PROMPT: Removes all rows corresponding to Ceuta and Melilla from the file, so that the rest of the data remains intact. Also explain the steps you're following so they can review them.

Figura 5. Step by step in the modification of data in Claude. Source: Claude.
At this point, Claude offers to download the modified file again, so a good checking practice would be to manually validate that the operation was done correctly. For example, check the number of rows in one file and another or check some rows at random with the first file to make sure that the data has not been corrupted.
3. First questions and visualizations
If the result so far is satisfactory, we can already start exploring the data to ask ourselves initial questions and look for interesting patterns. The ideal when starting the exploration is to ask big, clear and easy to answer questions with the data, because they give us a first vision.
PROMPT: It works with the file without Ceuta and Melilla from now on. Which have been the five communities with the most rural tourism in the total period?

Figure 6. Claude's response to the five communities with the most rural tourism in the period. Source: Claude.
Finally, we can ask Claude to help us visualize the data. Instead of making the effort to point you to a particular chart type, we give you the freedom to choose the format that best displays the information.
PROMPT: Can you visualize this information on a graph? Choose the most appropriate format to represent the data.

Figure 7. Graph prepared by Cloude to represent the information. Source: Claude.
Here, the screen unfolds: on the left, we can continue with the conversation or download the file, while on the right we can view the graph directly. Claude has generated a very visual and ready-to-use horizontal bar chart. The colors differentiate the communities and the date range and type of data are correctly indicated.
What happens if we ask you to change the color palette of the chart to an inappropriate one? In this case, for example, we are going to ask you for a series of pastel shades that are hardly different.
PROMPT: Can you change the color palette of the chart to this? #E8D1C5, #EDDCD2, #FFF1E6, #F0EFEB, #EEDDD3

.Figure 8. Adjustments made to the graph by Claude to represent the information. Source: Claude.
Faced with the challenge, Claude intelligently adjusts the graphic himself, darkens the background and changes the text on the labels to maintain readability and contrast
All of the above exercise has been done with Claude Sonnet 4.6, which is not Anthropic's highest quality model. Its higher versions, such as Claude Opus 4.6, have greater reasoning capacity, deep understanding and finer results. In addition, there are many other tools for working with AI-based data and visualizations, such as Julius or Quadratic. Although the possibilities are almost endless in them, when we work with data it is still essential to maintain our own methodology and criteria.
Contextualizing the data we are analyzing in real life and connecting it with other knowledge is not a task that can be delegated; We need to have a minimum prior idea of what we want to achieve with the analysis in order to transmit it to the system. This will allow us to ask better questions, properly interpret the results and therefore make a more effective prompting.
Content created by Carmen Torrijos, expert in AI applied to language and communication. The content and views expressed in this publication are the sole responsibility of the author.
Every year, the international open knowledge advocacy organization Open Knowledge Foundation (OKFN) organizes Open Data Day (ODD), a framework initiative that will bring together activities around the world to demonstrate the value of open data. It is a meeting point for public administrations, civil society, universities, technology companies and citizens interested in the reuse of public information. It is, above all, an invitation to move from theory to practice: to open data, reuse it and turn it into concrete solutions.
From datos.gob.es, national open data portal, we join this celebration by also compiling other activities that put data and related technologies at the center. In this post we review some events that will be held during this month of March. Take note and write down the agenda!
Data against misinformation: celebrate Open Data Day with Open Data Barcelona Initiative
This meeting is part of the activities organized in Spain on the occasion of Open Data Day 2026, and is focused on the role of open data as a tool to strengthen the quality of public information and combat disinformation. The event will give visibility to projects that use open data to promote a more transparent democracy, encourage informed citizen participation and contribute to the development of responsible artificial intelligence based on reliable data.
- When? On Tuesday, March 10 at 5:30 p.m.
- Where? Ca l'Alier C/ de Pere IV, 362 in Barcelona
- Learn more
The future of Open Data: OKFN's anniversary
On the occasion of Open Data Day 2026, the Open Knowledge Foundation (OKFN) is organizing an online conference to bring together the open data community and celebrate two decades of CKAN, the tool that emerged from OKFN's work that today powers data portals around the world. The meeting will provide an opportunity to discuss the current role of open data and data infrastructures in the face of contemporary technical and political challenges. It is aimed at professionals from governments, civil society, the media, activist groups and all those interested in reflecting on the future of open data in a rapidly changing technological context, marked especially by the emergence of artificial intelligence tools.
- When? On Wednesday, March 11 from 11 a.m. to 4 p.m.
- Where? Online
- Learn more
Data as a public good: European webinar
Organized by the data.europa.eu academy in the framework of Open Data Day, this webinar addresses how open data can act as a public good to improve decision-making in all territories, especially in rural areas. Through case studies from the United Kingdom and Ireland, the session will show how open information can identify local needs, reduce territorial inequalities and design evidence-based public policies that ensure more equitable access to essential services.
- When? Friday, March 13 from 10 a.m. to 11.30 a.m.
- Where? Online event
- Learn more
Solid World: innovation in the sharing and reuse of scientific data
This event will explore how to model, analyze, and share research data using technologies from the Solid* ecosystem. The session will feature representatives from W3C and Open Data Institute to present the SpOTy project, a web application for organizing and analyzing linguistic data that has migrated from RDF to Solid to give researchers greater control over the sharing of their data, also addressing challenges of interoperability and responsible reuse of scientific information.
*The Solid Ecosystem is a set of technologies, standards, and tools that enable individuals and organizations to control their own data on the web and decide how, when, and with whom it is shared.
- When? Monday, March 23 from 5 p.m. to 6 p.m.
- Where? Online event
- Learn more
How to prepare public portals for the AI era
The thirteenth edition of the Data Centric AI cycle, organized by the Open Data Institute (ODI), will explore how public data portals must evolve to adapt to new ways of interacting with datasets. It will address the transformation of infrastructures such as data.gov.uk, plans for the National Data Library and the role of academic research in the design of new public data architectures, combining preparation for artificial intelligence with a user-centric approach and reflecting on the social context surrounding data and AI.
- When? Thursday, March 26 from 5 pm to 6 pm
- Where? Online event
- Learn more
Online events on open data in different sectors with Open Data Week
Open Data Week is an annual festival of events held every March in New York City and organized by the NYC Open Data team in conjunction with BetaNYC and Data Through Design. The week commemorates the anniversary of the city's first open data law, signed on March 7, 2012, and also coincides with Open Data Day, reinforcing its connection with the international open data movement. Some of the scheduled activities will be in virtual format.
- When? From 22 to 29 March
- Where? Some events can be followed in streaming
- Learn more
Data ethics keys for organizations
This session of the Data Ethics Professionals cycle organized by ODI will focus on the main lessons learned by organizations that have initiated processes of integrating data ethics into their structures and workflows. The seminar will address common challenges such as obtaining management support, the practical incorporation of ethical tools and frameworks, and the management of workloads in organizational transformation processes.
- When? On Monday, March 30 from 2 p.m. to 3 p.m.
- Where? Online
- Learn more
In short, the calendar for the coming weeks offers multiple opportunities to delve into the strategic value of open data and associated technologies. From local initiatives against disinformation to sectoral data spaces and European seminars on data as a public good, the ecosystem continues to grow and diversify. We encourage you to participate, share these calls and transfer the learnings to your organization. Because Open Data Day is just the starting point: true transformation is built throughout the year, connecting community, knowledge and action through open data.
These are some of the events that are scheduled for this month of March. In any case, don't forget to follow us on social networks so you don't miss any news about innovation and open data. We are on X and LinkedIn you can write to us if you need extra information.
La The European Commission has recently presented the document setting out a new EU Strategy in the field of data. Among other ambitious objectives, this initiative aims to address a transcendental challenge in the era of generative artificial intelligence: the insufficient availability of data under the right conditions.
Since the previous 2020 Strategy, we have witnessed an important regulatory advance that aimed to go beyond the 2019 regulation on open data and reuse of public sector information.
Specifically, on the one hand, the Data Governance Act served to promote a series of measures that tended to facilitate the use of data generated by the public sector in those cases where other legal rights and interests were affected – personal data, intellectual property.
On the other hand, through the Data Act, progress was made, above all, in the line of promoting access to data held by private subjects, taking into account the singularities of the digital environment.
The necessary change of focus in the regulation on access to data.
Despite this significant regulatory effort, the European Commission has detected an underuse of data , which is also often fragmented in terms of the conditions of its accessibility. This is due, in large part, to the existence of significant regulatory diversity. Measures are therefore needed to facilitate the simplification and streamlining of the European regulatory framework on data.
Specifically, it has been found that there is regulatory fragmentation that generates legal uncertainty and disproportionate compliance costs due to the complexity of the applicable regulatory framework itself. Specifically, the overlap between the General Data Protection Regulation (GDPR), the Data Governance Act, the Data Act, the Open Data Directive and, likewise, the existence of sectoral regulations specific to some specific areas has generated a complex regulatory framework which is difficult to face, especially if we think about the competitiveness of small and medium-sized companies. Each of these standards was designed to address specific challenges that were addressed successively, so a more coherent overview is needed to resolve potential inconsistencies and ultimately facilitate their practical implementation.
In this regard, the Strategy proposes to promote a new legislative instrument – the proposal for a Regulation called Digital Omnibus – which aims to consolidate the rules relating to the European single market in the field of data into a single standard. Specifically, with this initiative:
- The provisions of the Data Governance Act are merged into the regulation of the Data Act, thus eliminating duplications.
- The Regulation on non-personal data, whose functions are also covered by the Data Act, is repealed;
- Public sector data standards are integrated into the Data Act, as they were previously included in both the 2019 Directive and the Data Governance Act.
This regulation therefore consolidates the role of the Data Act as a general reference standard in the field. It also strengthens the clarity and precision of its forecasts, with the aim of facilitating its role as the main regulatory instrument through which it is intended to promote the accessibility of data in the European digital market.
Modifications in terms of personal data protection
The Digital Omnibus proposal also includes important new features with regard to the regulations on the protection of personal data, amending several provisions of Regulation (EU) 1016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data.
In order for personal data to be used – that is, any information referring to an identified or identifiable natural person – it is necessary that one of the circumstances referred to in Article 6 of the aforementioned Regulation is present, including the consent of the owner or the existence of a legitimate interest on the part of the person who is going to process the data.
Legitimate interest allows personal data to be processed when it is necessary for a valid purpose (improving a service, preventing fraud, etc.) and does not adversely affect the rights of the individual.
Source: Guide on legitimate interest. ISMS Forum and Data Privacy Institute. Available here: guiaintereslegitimo1637794373.pdf
Regarding the possibility of resorting to legitimate interest as a legal basis for training artificial intelligence tools, the current regulation allows the processing of personal data as long as the rights of the interested parties who own such data do not prevail.
However, given the generality of the concept of "legitimate interest", when deciding when personal data may be used under this clause , there will not always be absolute certainty, it will be necessary to analyse on a case-by-case basis: specifically, it will be necessary to carry out an activity of weighing the conflicting legal interests and, therefore, its application may give rise to reasonable doubts in many cases.
Although the European Data Protection Board has tried to establish some guidelines to specify the application of legitimate interest, the truth is that the use of open and indeterminate legal concepts will not always allow clear and definitive answers to be reached. To facilitate the specification of this expression in each case, the Strategy refers as a criterion to takeinto account the potential benefit that the processing may entail for the data subject and for society in general. Likewise, given that the consent of the owner of the data will not be necessary – and therefore, its revocation would not be applicable – it reinforces the right of opposition by the owner to the processing of their data and, above all, guarantees greater transparency regarding the conditions under which the data will be processed. Thus, by strengthening the legal position of the data subject and referring to this potential benefit, the Strategy aims to facilitate the use of legitimate interest as a legal basis for the use of personal data without the consent of the data subject, but with appropriate safeguards.
Another major data protection measure concerns the distinction between anonymised and pseudonymised data. The GDPR defines pseudonymisation as data processing that, until now, could no longer be attributed to a data subject without recourse to additional, separate information. However, pseudonymised data is still personal data and, therefore, subject to this regulation. On the other hand, anonymous data does not relate to identified or identifiable persons and therefore its use would not be subject to the GDPR. Consequently, in order to know whether we are talking about anonymous or pseudo-nimized data, it is essential to specify whether there is a "reasonable probability" of identifying the owner of the data.
However, the technologies currently available multiply the risk of re-identification of the data subject, which directly affects what could be considered reasonable, generating uncertainty that has a negative impact on technological innovation. For this reason, the Digital Omnibus proposal, along the lines already stated by the Court of Justice of the European Union, aims to establish the conditions under which pseudonymised data could no longer be considered personal data, thus facilitating its use. To this end, it empowers the European Commission, through implementing acts, to specify such circumstances, in particular taking into account the state of the art and, likewise, offering criteria that allow the risk of re-identification to be assessed in each specific case.
Scaling High-Value Datasets
The Strategy also aims to expand the catalogue of High Value Data (HVD) provided for in Implementing Regulation (EU) 2023/138. These are datasets with exceptional potential to generate social, economic and environmental benefits, as they are high-quality, structured and reliable data that are accessible under technical, organisational and semantic conditions that are very favourable for automated processing. Six categories are currently included (geospatial, Earth observation and environment, meteorology, statistics, business and mobility), to which the Commission would add, among other things, legal, judicial and administrative data.
Opportunity and challenge
The European Data Strategy represents a paradigmatic shift that is certainly relevant: it is not only a matter of promoting regulatory frameworks that facilitate the accessibility of data at a theoretical level but, above all, of making them work in their practical application, thus promoting the necessary conditions of legal certainty that allow a competitive and innovative data economy to be energized.
To this end, it is essential, on the one hand, to assess the real impact of the measures proposed through the Digital Omnibus and, on the other, to offer small and medium-sized enterprises appropriate legal instruments – practical guides, suitable advisory services, standard contractual clauses, etc. – to face the challenge that regulatory compliance poses for them in a context of enormous complexity. Precisely, this difficulty requires, on the part of the supervisory authorities and, in general, of public entities, to adopt advanced and flexible data governance models that adapt to the singularities posed by artificial intelligence, without affecting legal guarantees.
Content prepared by Julián Valero, professor at the University of Murcia and coordinator of the Innovation, Law, and Technology Research Group (iDerTec). The content and views expressed in this publication are the sole responsibility of the author.
Did you know that Spain created the first state agency specifically dedicated to the supervision of artificial intelligence (AI) in 2023? Even anticipating the European Regulation in this area, the Spanish Agency for the Supervision of Artificial Intelligence (AESIA) was born with the aim of guaranteeing the ethical and safe use of AI, promoting responsible technological development.
Among its main functions is to ensure that both public and private entities comply with current regulations. To this end, it promotes good practices and advises on compliance with the European regulatory framework, which is why it has recently published a series of guides to ensure the consistent application of the European AI regulation.
In this post we will delve into what the AESIA is and we will learn relevant details of the content of the guides.
What is AESIA and why is it key to the data ecosystem?
The AESIA was created within the framework of Axis 3 of the Spanish AI Strategy. Its creation responds to the need to have an independent authority that not only supervises, but also guides the deployment of algorithmic systems in our society.
Unlike other purely sanctioning bodies, the AESIA is designed as an intelligence Think & Do, i.e. an organisation that investigates and proposes solutions. Its practical usefulness is divided into three aspects:
- Legal certainty: Provides clear frameworks for businesses, especially SMEs, to know where to go when innovating.
- International benchmark: it acts as the Spanish interlocutor before the European Commission, ensuring that the voice of our technological ecosystem is heard in the development of European standards.
- Citizen trust: ensures that AI systems used in public services or critical areas respect fundamental rights, avoiding bias and promoting transparency.
Since datos.gob.es, we have always defended that the value of data lies in its quality and accessibility. The AESIA complements this vision by ensuring that, once data is transformed into AI models, its use is responsible. As such, these guides are a natural extension of our regular resources on data governance and openness.
Resources for the use of AI: guides and checklists
The AESIA has recently published materials to support the implementation and compliance with the European Artificial Intelligence regulations and their applicable obligations. Although they are not binding and do not replace or develop existing regulations, they provide practical recommendations aligned with regulatory requirements pending the adoption of harmonised implementing rules for all Member States.
They are the direct result of the Spanish AI Regulatory Sandbox pilot. This sandbox allowed developers and authorities to collaborate in a controlled space to understand how to apply European regulations in real-world use cases.
It is essential to note that these documents are published without prejudice to the technical guides that the European Commission is preparing. Indeed, Spain is serving as a "laboratory" for Europe: the lessons learned here will provide a solid basis for the Commission's working group, ensuring consistent application of the regulation in all Member States.
The guides are designed to be a complete roadmap, from the conception of the system to its monitoring once it is on the market.

Figure 1. AESIA guidelines for regulatory compliance. Source: Spanish Agency for the Supervision of Artificial Intelligence
- 01. Introductory to the AI Regulation: provides an overview of obligations, implementation deadlines and roles (suppliers, deployers, etc.). It is the essential starting point for any organization that develops or deploys AI systems.
- 02. Practice and examples: land legal concepts in everyday use cases (e.g., is my personnel selection system a high-risk AI?). It includes decision trees and a glossary of key terms from Article 3 of the Regulation, helping to determine whether a specific system is regulated, what level of risk it has, and what obligations are applicable.
- 03. Conformity assessment: explains the technical steps necessary to obtain the "seal" that allows a high-risk AI system to be marketed, detailing the two possible procedures according to Annexes VI and VII of the Regulation as valuation based on internal control or evaluation with the intervention of a notified body.
- 04. Quality management system: defines how organizations must structure their internal processes to maintain constant standards. It covers the regulatory compliance strategy, design techniques and procedures, examination and validation systems, among others.
- 05. Risk management: it is a manual on how to identify, evaluate and mitigate possible negative impacts of the system throughout its life cycle.
- 06. Human surveillance: details the mechanisms so that AI decisions are always monitorable by people, avoiding the technological "black box". It establishes principles such as understanding capabilities and limitations, interpretation of results, authority not to use the system or override decisions.
- 07. Data and data governance: addresses the practices needed to train, validate, and test AI models ensuring that datasets are relevant, representative, accurate, and complete. It covers data management processes (design, collection, analysis, labeling, storage, etc.), bias detection and mitigation, compliance with the General Data Protection Regulation, data lineage, and design hypothesis documentation, being of particular interest to the open data community and data scientists.
- 08. Transparency: establishes how to inform the user that they are interacting with an AI and how to explain the reasoning behind an algorithmic result.
- 09. Accuracy: Define appropriate metrics based on the type of system to ensure that the AI model meets its goal.
- 10. Robustness: Provides technical guidance on how to ensure AI systems operate reliably and consistently under varying conditions.
- 11. Cybersecurity: instructs on protection against threats specific to the field of AI.
- 12. Logs: defines the measures to comply with the obligations of automatic registration of events.
- 13. Post-market surveillance: documents the processes for executing the monitoring plan, documentation and analysis of data on the performance of the system throughout its useful life.
- 14. Incident management: describes the procedure for reporting serious incidents to the competent authorities.
- 15. Technical documentation: establishes the complete structure that the technical documentation must include (development process, training/validation/test data, applied risk management, performance and metrics, human supervision, etc.).
- 16. Requirements Guides Checklist Manual: explains how to use the 13 self-diagnosis checklists that allow compliance assessment, identifying gaps, designing adaptation plans and prioritizing improvement actions.
All guides are available here and have a modular structure that accommodates different levels of knowledge and business needs.
The self-diagnostic tool and its advantages
In parallel, the AESIA publishes material that facilitates the translation of abstract requirements into concrete and verifiable questions, providing a practical tool for the continuous assessment of the degree of compliance.
These are checklists that allow an entity to assess its level of compliance autonomously.
The use of these checklists provides multiple benefits to organizations. First, they facilitate the early identification of compliance gaps, allowing organizations to take corrective action prior to the commercialization or commissioning of the system. They also promote a systematic and structured approach to regulatory compliance. By following the structure of the rules of procedure, they ensure that no essential requirement is left unassessed.
On the other hand, they facilitate communication between technical, legal and management teams, providing a common language and a shared reference to discuss regulatory compliance. And finally, checklists serve as a documentary basis for demonstrating due diligence to supervisory authorities.
We must understand that these documents are not static. They are subject to an ongoing process of evaluation and review. In this regard, the EASIA continues to develop its operational capacity and expand its compliance support tools.
From the open data platform of the Government of Spain, we invite you to explore these resources. AI development must go hand in hand with well-governed data and ethical oversight.
For more than a decade, open data platforms have measured their impact through relatively stable indicators: number of downloads, web visits, documented reuses, applications or services created based on them, etc. These indicators worked well in an ecosystem where users – companies, journalists, developers, anonymous citizens, etc. – directly accessed the original sources to query, download and process the data.
However, the panorama has changed radically. The emergence of generative artificial intelligence models has transformed the way people access information. These systems generate responses without the need for the user to visit the original source, which is causing a global drop in web traffic in media, blogs and knowledge portals.
In this new context, measuring the impact of an open data platform requires rethinking traditional indicators to incorporate new ones to the metrics already used that also capture the visibility and influence of data in an ecosystem where human interaction is changing.

Figure 1. Metrics for measuring the impact of open data in the age of AI.
A structural change: from click to indirect consultation
The web ecosystem is undergoing a profound transformation driven by the rise of large language models (LLMs). More and more people are asking their questions directly to systems such as ChatGPT, Copilot, Gemini or Perplexity, obtaining immediate and contextualized answers without the need to resort to a traditional search engine.
At the same time, those who continue to use search engines such as Google or Bing are also experiencing relevant changes derived from the integration of artificial intelligence on these platforms. Google, for example, has incorporated features such as AI Overviews, which offers automatically generated summaries at the top of the results, or AI Mode, a conversational interface that allows you to drill down into a query without browsing links. This generates a phenomenon known as Zero-Click: the user performs a search on an engine such as Google and gets the answer directly on the results page itself. As a result, you don't need to click on any external links, which limits visits to the original sources from which the information is extracted.
All this implies a key consequence: web traffic is no longer a reliable indicator of impact. A website can be extremely influential in generating knowledge without this translating into visits.
New metrics to measure impact
Faced with this situation, open data platforms need new metrics that capture their presence in this new ecosystem. Some of them are listed below.
-
Share of Model (SOM): Presence in AI models
Inspired by digital marketing metrics, the Share of Model measures how often AI models mention, cite, or use data from a particular source. In this way, the SOM helps to see which specific data sets (employment, climate, transport, budgets, etc.) are used by the models to answer real questions from users, revealing which data has the greatest impact.
This metric is especially valuable because it acts as an indicator of algorithmic trust: when a model mentions a web page, it is recognizing its reliability as a source. In addition, it helps to increase indirect visibility, since the name of the website appears in the response even when the user does not click.
-
Sentiment analysis: tone of mentions in AI
Sentiment analysis allows you to go a step beyond the Share of Model, as it not only identifies if an AI model mentions a brand or domain, but how it does so. Typically, this metric classifies the tone of the mention into three main categories: positive, neutral, and negative.
Applied to the field of open data, this analysis helps to understand the algorithmic perception of a platform or dataset. For example, it allows detecting whether a model uses a source as an example of good practice, if it mentions it neutrally as part of an informative response, or if it associates it with problems, errors, or outdated data.
This information can be useful to identify opportunities for improvement, strengthen digital reputation, or detect potential biases in AI models that affect the visibility of an open data platform.
-
Categorization of prompts: in which topics a brand stands out
Analyzing the questions that users ask allows you to identify what types of queries a brand appears most frequently in. This metric helps to understand in which thematic areas – such as economy, health, transport, education or climate – the models consider a source most relevant.
For open data platforms, this information reveals which datasets are being used to answer real user questions and in which domains there is greater visibility or growth potential. It also allows you to spot opportunities: if an open data initiative wants to position itself in new areas, it can assess what kind of content is missing or what datasets could be strengthened to increase its presence in those categories.
-
Traffic from AI: clicks from digests generated
Many models already include links to the original sources. While many users don't click on such links, some do. Therefore, platforms can start measuring:
- Visits from AI platforms (when these include links).
- Clicks from rich summaries in AI-integrated search engines.
This means a change in the distribution of traffic that reaches websites from the different channels. While organic traffic—traffic from traditional search engines—is declining, traffic referred from language models is starting to grow.
This traffic will be smaller in quantity than traditional traffic, but more qualified, since those who click from an AI usually have a clear intention to go deeper.
It is important that these aspects are taken into account when setting growth objectives on an open data platform.
-
Algorithmic Reuse: Using Data in Models and Applications
Open data powers AI models, predictive systems, and automated applications. Knowing which sources have been used for their training would also be a way to know their impact. However, few solutions directly provide this information. The European Union is working to promote transparency in this field, with measures such as the template for documenting training data for general-purpose models, but its implementation – and the existence of exceptions to its compliance – mean that knowledge is still limited.
Measuring the increase in access to data through APIs could give an idea of its use in applications to power intelligent systems. However, the greatest potential in this field lies in collaboration with companies, universities and developers immersed in these projects, so that they offer a more realistic view of the impact.
Conclusion: Measure what matters, not just what's easy to measure
A drop in web traffic doesn't mean a drop in impact. It means a change in the way information circulates. Open data platforms must evolve towards metrics that reflect algorithmic visibility, automated reuse, and integration into AI models.
This doesn't mean that traditional metrics should disappear. Knowing the accesses to the website, the most visited or the most downloaded datasets continues to be invaluable information to know the impact of the data provided through open platforms. And it is also essential to monitor the use of data when generating or enriching products and services, including artificial intelligence systems. In the age of AI, success is no longer measured only by how many users visit a platform, but also by how many intelligent systems depend on its information and the visibility that this provides.
Therefore, integrating these new metrics alongside traditional indicators through a web analytics and SEO strategy * allows for a more complete view of the real impact of open data. This way we will be able to know how our information circulates, how it is reused and what role it plays in the digital ecosystem that shapes society today.
*SEO (Search Engine Optimization) is the set of techniques and strategies aimed at improving the visibility of a website in search engines.
Public administrations and, specifically, local entities are at a crucial moment of digital transformation. The accelerated development of artificial intelligence (AI) poses both extraordinary opportunities and complex challenges that require structured, ethical, and informed adaptation. In this context, the Spanish Federation of Municipalities and Provinces (FEMP) has launched the Practical Guide and Policies for the Use of Artificial Intelligence in Local Entities, a reference document that aspires to function as a compass for city councils, provincial councils and other local entities on their way to the responsible adoption of this technology that is advancing by leaps and bounds.
The guide is based on a key idea: AI is not just a technological issue, but also an organisational, legal, ethical and cultural one. Its implementation requires planning, governance and a strategic vision adapted to the size and digital maturity of each local authority. In this post, we will look at some of the key points in the document.
In this video you can watch the presentation session of the Guide again.
The guide is based on a key idea: AI is not only a technological issue, but also an organizational, legal, ethical and cultural one. Its implementation requires planning, governance and a strategic vision adapted to the size and digital maturity of each local entity. In this post, we will look at some relevant keys to the document.
Why an AI guide for local authorities
Local administrations have been pursuing continuous improvement of public services for years, but have often been constrained by a lack of technological resources, organizational rigidity, or data fragmentation. AI opens up an unprecedented opportunity to overcome many of these barriers, because it makes it possible to:
- Automate processes.
- Analyze large volumes of information.
- Anticipate citizen needs.
- Personalize public attention.
However, along with these opportunities come obvious risks: loss of transparency, discriminatory biases, violations of privacy or uncritical automation of decisions that affect fundamental rights. Hence the need for a guide that helps to know what can be done with AI, what should not be done and how to do it with guarantees.
48% of public administrations use Artificial Intelligence to streamline the relationship with citizens
The guide is structured around several fundamental axes that address the multiple dimensions of AI implementation at the local level:
The legal framework: the European AI Regulation as a central axis
One of the pillars of the guide is the analysis of the European Artificial Intelligence Regulation (RIA), the first comprehensive standard worldwide that regulates AI with a risk-based approach, and which fully affects local entities whether they develop, use or contract AI systems.
Specifically, the different levels of risk recognized by the RIA are:
- Unacceptable risk: includes prohibited practices such as social punctuation, subliminal manipulation or certain uses of biometrics.
- High risk: covers systems used in sensitive areas such as the management of public services, employment, education, security or administrative decision-making. These systems must meet a number of strict requirements, such as the need for human supervision or data traceability.
- Specific transparency risk: This applies to chatbots or AI-generated content and is mainly imposed on them reporting obligations (e.g. labelling content as AI-generated).
- Minimal risk: such as spam filters or AI-based video games, with no obligations, although adopting codes of conduct is recommended.
Crucially, from February 2025, local authorities must make their staff AI literate, i.e. provide training for those who operate these systems to understand their technical, legal and ethical implications. In addition, they must review whether any of their systems incur in practices prohibited by the RIA, such as subliminal manipulation or social scoring .
For local authorities, this implies the need to identify which AI systems they use, assess their level of risk and comply with the corresponding obligations.
AI, administrative procedure and data protection
The guide recalls that automating a process does not necessarily equate to using AI, but that when systems capable of inferring, recommending or deciding are incorporated, the legal impact is much greater.
The incorporation of AI in administrative procedures must respect principles such as:
- Transparency and explainability of decisions.
- Identification of the responsible body.
- Possibility of challenge.
- Effective human supervision.
In addition, the use of AI must be fully compliant with the General Data Protection Regulation (GDPR). Local authorities should be able to justify automated decisions, guarantee the rights of affected individuals and exercise extreme caution when processing sensitive personal data.
Governing Data to Govern AI
The guide is blunt on one point: AI cannot be implemented without strong data governance. AI is powered by data, and its quality, availability, and ethical management will determine the success or failure of any initiative. The document introduces the concept of "single data" and refers to unified information. In relation to this, it has been seen that many organizations discover structural deficiencies precisely when trying to implement AI, an idea that was addressed in the datos.gob.es podcast on data and AI.
For data governance in AI in the local context, the guide defines the importance of:
- Privacy by design.
- Strategic value of data.
- Institutional ethical responsibility.
- Traceability.
- Shared knowledge and quality management.
In addition, the adoption of recognized international frameworks such as DAMA-DMBOOK is recommended.
The guide also insists on the importance of the quality, availability and correct management of data to "guarantee an effective and responsible use of artificial intelligence in our local administrations". To do this, it is essential to:
• Adopt interoperability standards, such as those already existing at national and European level.
• Use APIs and secure data exchange systems, which allow information to be shared efficiently between different public bodies.
• Leverage open data sources, such as those provided by the Spanish Government's Open Data Portal or local public data platforms.
How to know the maturity status in AI
Another of the most innovative aspects of the guide is the FEMP methodology. AI, which allows local entities to self-assess their level of organizational maturity for AI deployment. This methodology distinguishes three progressive levels that should be adopted in order:
- Level 1 - Electronic Administration (EA): digitalisation of business areas through corporate solutions based on single data.
- Level 2 - Robotization of Automated Processes (RPA): automation of management processes and administrative actions.
- Level 3 - Artificial Intelligence: use of specialized analysis tools that provide information obtained automatically.
This gradual approach is essential, as it recognizes that not all entities start from the same point and that trying to skip stages can be counterproductive.
The guide stresses that AI should be seen as a support tool, not as a substitute for human judgment, especially in sensitive decisions.
Requirements for deploying AI in a local entity
The document systematically details the requirements necessary to implement AI with guarantees:
- Normative and ethical, ensuring legal compliance and respect for fundamental rights.
- Organizational, defining technical, legal, and governance roles.
- Technological, including infrastructure, integration with existing systems, scalability, and cybersecurity.
- Strategic, betting on progressive deployments, pilots and continuous evaluation.
Beyond technology, the guide underlines the importance of ethics, transparency and public trust. All this is key and points to the idea that success does not lie in advancing quickly, but in advancing well: with a solid foundation, avoiding improvisations and ensuring that AI is applied ethically, effectively and oriented to the general interest.
Likewise, the relevance of public-private collaboration and the exchange of experiences between local entities is highlighted, as a way to reduce risks, share knowledge and optimize resources.
Real cases and conclusions
The document is completed with numerous real use cases in city councils and provincial councils, which demonstrate that AI is already a tangible reality in areas such as citizen service, social management, urban planning licenses or municipal chatbots.
In conclusion, the FEMP guide is presented as an essential manual for any local entity that wants to address AI responsibly. Its main contribution is not only to explain what AI is, but to offer a practical framework to implement it in a meaningful way, always putting citizenship, fundamental rights and good governance at the centre.
Massive, superficial AI-generated content isn't just a problem, it's also a symptom. Technology amplifies a consumption model that rewards fluidity and drains our attention span.
We listen to interviews, podcasts and audios of our family at 2x. We watch videos cut into highlights, and we base decisions and criteria on articles and reports that we have only read summarized with AI. We consume information in ultra-fast mode, but at a cognitive level we give it the same validity as when we consumed it more slowly, and we even apply it in decision-making. What is affected by this process is not the basic memory of contents, which seems to be maintained according to controlled studies, but the ability to connect that knowledge with what we already had and to elaborate our own ideas with it. More than superficiality, it is worrying that this new way of thinking is sufficient in so many contexts.
What's new and what's not?
We may think that generative AI has only intensified an old dynamic in which content production is infinite, but our attention spans are the same. We cannot fool ourselves either, because since the Internet has existed, infinity is not new. If we were to say that the problem is that there is too much content, we would be complaining about a situation in which we have been living for more than twenty years. Nor is the crisis of authority of official information or the difficulty in distinguishing reliable sources from those that are not.
However, the AI slop, which is the flood of AI-generated digital content on the Internet, brings its own logic and new considerations, such as breaking the link between effort and content, or that all that is generated is a statistical average of what already existed. This uniform and uncontrolled flow has consequences: behind the mass-generated content there may be an orchestrated intention of manipulation, an algorithmic bias, voluntary or not, that harms certain groups or slows down social advances, and also a random and unpredictable distortion of reality.
But how much of what I read is AI?
By 2025, it has been estimated that a large portion of online content incorporates synthetic text: an Ahrefs analysis of nearly one million web pages published in the first half of the year found that 74.2% of new pages contained signals of AI-generated content. Graphite research from the same year cites that, during the first year of ChatGPT alone, 39% of all online content was already generated with AI. Since November 2024, that figure has remained stable at around 52%, meaning that since then AI content outnumbers human content.
However, there are two questions we should ask ourselves when we come across estimates of this type:
1. Is there a reliable mechanism to distinguish a written text from a generated text? If the answer is no, no matter how striking and coherent the conclusions are, we cannot give them value, because they could be true or not. It is a valuable quantitative data, but one that does not yet exist.
With the information we currently have, we can say that "AI-generated text" detectors fail as often as a random model would, so we cannot attribute reliability to them. In a recent study cited by The Guardian, detectors were correct about whether the text was generated with AI or not in less than 40% of cases. On the other hand, in the first paragraph of Don Quixote, certain detectors have also returned an 86% probability that the text was created by AI.
2. What does it mean that a text is generated with AI? On the other hand, the process is not always completely automatic (what we call copying and pasting) but there are many grays in the scale: AI inspires, organizes, assists, rewrites or expands ideas, and denying, delegitimizing or penalizing this writing would be ignoring an installed reality.
The two nuances above do not cancel out the fact that the AI slop exists, but this does not have to be an inevitable fate. There are ways to mitigate its effects on our abilities.
What are the antidotes?
We may not contribute to the production of synthetic content, but we cannot slow down what is happening, so the challenge is to review the criteria and habits of mind with which we approach both reading and writing content.
1. Prioritize what clicks: one of the few reliable signals we have left is that clicking sensation at the moment when something connects with a previous knowledge, an intuition that we had diffused or an experience of our own, and reorganizes it or makes it clear. We also often say that it "resonates". If something clicks, it's worth following, confirming, researching, and briefly elaborating on a personal level.
2. Look for friction with data: anchoring content in open data and verifiable sources introduces healthy friction against the AI slop. It reduces, above all, arbitrariness and the feeling of interchangeable content, because the data force us to interpret and put it in context. It is a way of putting stones in the excessively fluid river that is the generation of language, and it works when we read and when we write.
3. Who is responsible? The text exists easily now, the question is why it exists or what it wants to achieve, and who is ultimately responsible for that goal. It seeks the signature of people or organizations, not so much for authorship but for responsibility. He is wary of collective signatures, also in translations and adaptations.
4. Change the focus of merit: evaluate your inertia when reading, because perhaps one day you learned to give merit to texts that sounded convincing, used certain structures or went up to a specific register. It shifts value to non-generatable elements such as finding a good story, knowing how to formulate a vague idea or daring to give a point of view in a controversial context.
On the other side of the coin, it is also a fact that content created with AI enters with an advantage in the flow, but with a disadvantage in credibility. This means that the real risk now is that AI can create high-value content, but people have lost the ability to concentrate on valuing it. To this we must add the installed prejudice that, if it is with AI, it is not valid content. Protecting our cognitive abilities and learning to differentiate between compressible and non-compressible content is therefore not a nostalgic gesture, but a skill that in the long run can improve the quality of public debate and the substrate of common knowledge.
Content created by Carmen Torrijos, expert in AI applied to language and communication. The content and views expressed in this publication are the sole responsibility of the author.
Open data is a central piece of digital innovation around artificial intelligence as it allows, among other things, to train models or evaluate machine learning algorithms. But between "downloading a CSV from a portal" and accessing a dataset ready to apply machine learning techniques , there is still an abyss.
Much of that chasm has to do with metadata, i.e. how datasets are described (at what level of detail and by what standards). If metadata is limited to title, description, and license, the work of understanding and preparing data becomes more complex and tedious for the person designing the machine learning model. If, on the other hand, standards that facilitate interoperability are used, such as DCAT, the data becomes more FAIR (Findable, Accessible, Interoperable, Reusable) and, therefore, easier to reuse. However, additional metadata is needed to make the data easier to integrate into machine learning flows.
This article provides an overview of the various initiatives and standards needed to provide open data with metadata that is useful for the application of machine learning techniques.
DCAT as the backbone of open data portals
The DCAT (Data Catalog Vocabulary) vocabulary was designed by the W3C to facilitate interoperability between data catalogs published on the Web. It describes catalogs, datasets, and distributions, being the foundation on which many open data portals are built.
In Europe, DCAT is embodied in the DCAT-AP application profile, recommended by the European Commission and widely adopted to describe datasets in the public sector, for example, in Spain with DCAT-AP-ES. DCAT-AP answers questions such as:
- What datasets exist on a particular topic?
- Who publishes them, under what license and in what formats?
- Where are the download URLs or access APIs?
Using a standard like DCAT is imperative for discovering datasets, but you need to go a step further in order to understand how they are used in machine learning models or what quality they are from the perspective of these models.
MLDCAT-AP: Machine Learning in an Open Data Portal Catalog
MLDCAT-AP (Machine Learning DCAT-AP) is a DCAT application profile developed by SEMIC and the Interoperable Europe community, in collaboration with OpenML, that extends DCAT-AP to the machine learning domain.
MLDCAT-AP incorporates classes and properties to describe:
- Machine learning models and their characteristics.
- Datasets used in training and assessment.
- Quality metrics obtained on datasets.
- Publications and documentation associated with machine learning models.
- Concepts related to risk, transparency and compliance with the European regulatory context of the AI Act.
With this, a catalogue based on MLDCAT-AP no longer only responds to "what data is there", but also to:
- Which models have been trained on this dataset?
- How has that model performed by certain metrics?
- Where is this work described (scientific articles, documentation, etc.)?
MLDCAT-AP represents a breakthrough in traceability and governance, but the definition of metadata is maintained at a level that does not yet consider the internal structure of the datasets or what exactly their fields mean. To do this, it is necessary to go down to the level of the structure of the dataset distribution itself.
Metadata at the internal structure level of the dataset
When you want to describe what's inside the distributions of datasets (fields, types, constraints), an interesting initiative is Data Package, part of the Frictionless Data ecosystem.
A Data Package is defined by a JSON file that describes a set of data. This file includes not only general metadata (such as name, title, description or license) and resources (i.e. data files with their path or a URL to access their corresponding service), but also defines a schema with:
- Field names.
- Data types (integer, number, string, date, etc.).
- Constraints, such as ranges of valid values, primary and foreign keys, and so on.
From a machine learning perspective, this translates into the possibility of performing automatic structural validation before using the data. In addition, it also allows for accurate documentation of the internal structure of each dataset and easier sharing and versioning of datasets.
In short, while MLDCAT-AP indicates which datasets exist and how they fit into the realm of machine learning models, Data Package specifies exactly "what's there" within datasets.
Croissant: Metadata that prepares open data for machine learning
Even with the support of MLDCAT-AP and Data Package, it would be necessary to connect the underlying concepts in both initiatives. On the one hand, the field of machine learning (MLDCAT-AP) and on the other hand, that of the internal structures of the data itself (Data Package). In other words, the metadata of MLDCAT-AP and Data Package may be used, but in order to overcome some limitations that both suffer, it is necessary to complement it. This is where Croissant comes into play, a metadata format for preparing datasets for machine learning application. Croissant is developed within the framework of MLCommons, with the participation of industry and academia.
Specifically, Croissant is implemented in JSON-LD and built on top of schema.org/Dataset, a vocabulary for describing datasets on the Web. Croissant combines the following metadata:
- General metadata of the dataset.
- Description of resources (files, tables, etc.).
- Data structure.
- Semantic layer on machine learning (separation of training/validation/test data, target fields, etc.)
It should be noted that Croissant is designed so that different repositories (such as Kaggle, HuggingFace, etc.) can publish datasets in a format that machine learning libraries (TensorFlow, PyTorch, etc.) can load homogeneously. There is also a CKAN extension to use Croissant in open data portals.
Other complementary initiatives
It is worth briefly mentioning other interesting initiatives related to the possibility of having metadata to prepare datasets for the application of machine learning ("ML-ready datasets"):
- schema.org/Dataset: Used in web pages and repositories to describe datasets. It is the foundation on which Croissant rests and is integrated, for example, into Google's structured data guidelines to improve the localization of datasets in search engines.
- CSV on the Web (CSVW): W3C set of recommendations to accompany CSV files with JSON metadata (including data dictionaries), very aligned with the needs of tabular data documentation that is then used in machine learning.
- Datasheets for Datasets and Dataset Cards: Initiatives that enable the development of narrative and structured documentation to describe the context, provenance, and limitations of datasets. These initiatives are widely adopted on platforms such as Hugging Face.
Conclusions
There are several initiatives that help to make a suitable metadata definition for the use of machine learning with open data:
- DCAT-AP and MLDCAT-AP articulate catalog-level, machine learning models, and metrics.
- Data Package describes and validates the structure and constraints of data at the resource and field level.
- Croissant connects this metadata to the machine learning flow, describing how the datasets are concrete examples for each model.
- Initiatives such as CSVW or Dataset Cards complement the previous ones and are widely used on platforms such as HuggingFace.
These initiatives can be used in combination. In fact, if adopted together, open data is transformed from simply "downloadable files" to machine learning-ready raw material, reducing friction, improving quality, and increasing trust in AI systems built on top of it.
Jose Norberto Mazón, Professor of Computer Languages and Systems at the University of Alicante. The contents and views expressed in this publication are the sole responsibility of the author.
Three years after the acceleration of the massive deployment of Artificial Intelligence began with the launch of ChatGPT, a new term emerges strongly: Agentic AI. In the last three years, we have gone from talking about language models (such as LLMs) and chatbots (or conversational assistants) to designing the first systems capable not only of answering our questions, but also of acting autonomously to achieve objectives, combining data, tools and collaborations with other AI agents or with humans. That is, the global conversation about AI is moving from the ability to "converse" to the ability to "act" of these systems.
In the private sector, recent reports from large consulting firms describe AI agents that resolve customer incidents from start to finish, orchestrate supply chains, optimize inventories in the retail sector or automate business reporting. In the public sector, this conversation is also beginning to take shape and more and more administrations are exploring how these systems can help simplify procedures or improve citizen service. However, the deployment seems to be somewhat slower because logically the administration must not only take into account technical excellence but also strict compliance with the regulatory framework, which in Europe is set by the AI Regulation, so that autonomous agents are, above all, allies of citizens.
What is Agentic AI?
Although it is a recent concept that is still evolving, several administrations and bodies are beginning to converge on a definition. For example, the UK government describes agent AI as systems made up of AI agents that "can autonomously behave and interact to achieve their goals." In this context, an AI agent would be a specialized piece of software that can make decisions and operate cooperatively or independently to achieve the system's goals.
We might think, for example, of an AI agent in a local government who receives a request from a person to open a small business. The agent, designed in accordance with the corresponding administrative procedure, would check the applicable regulations, consult urban planning and economic activity data, verify requirements, fill in draft documents, propose appointments or complementary procedures and prepare a summary so that the civil servants could review and validate the application. That is, it would not replace the human decision, but would automate a large part of the work between the request made by the citizen and the resolution issued by the administration.
Compared to a conversational chatbot – which answers a question and, in general, ends the interaction there – an AI agent can chain multiple actions, review results, correct errors, collaborate with other AI agents and continue to iterate until it reaches the goal that has been defined for it. This does not mean that autonomous agents decide on their own without supervision, but that they can take over a good part of the task always following well-defined rules and safeguards.
Key characteristics of a freelance agent include:
- Perception and reasoning: is the ability of an agent to understand a complex request, interpret the context, and break down the problem into logical steps that lead to solving it.
- Planning and action: it is the ability to order these steps, decide the sequence in which they are going to be executed, and adapt the plan when the data changes or new constraints appear.
- Use of tools: An agent can, for example, connect to various APIs, query databases, open data catalogs, open and read documents, or send emails as required by the tasks they are trying to solve.
- Memory and context: is the ability of the agent to maintain the memory of interactions in long processes, remembering past actions and responses and the current state of the request it is resolving.
- Supervised autonomy: an agent can make decisions within previously established limits to advance towards the goal without the need for human intervention at each step, but always allowing the review and traceability of decisions.
We could summarize the change it entails with the following analogy: if LLMs are the engine of reasoning, AI agents are systems that , in addition to the ability to "think" about the actions that should be done, have "hands" to interact with the digital world and even with the physical world and execute those same actions.
The potential of AI agents in public services
Public services are organized, to a large extent, around processes of a certain complexity such as the processing of aid and subsidies, the management of files and licenses or the citizen service itself through multiple channels. They are processes with many different steps, rules and actors, where repetitive tasks and manual work of reviewing documentation abound.
As can be seen in the European Union's eGovernment Benchmark, eGovernment initiatives in recent decades have made it possible to move towards greater digitalisation of public services. However, the new wave of AI technologies, especially when foundational models are combined with agents, opens the door to a new leap to intelligently automate and orchestrate a large part of administrative processes.
In this context, autonomous agents would allow:
- Orchestrate end-to-end processes such as collecting data from different sources, proposing forms already completed, detecting inconsistencies in the documentation provided, or generating draft resolutions for validation by the responsible personnel.
- Act as "co-pilots" of public employees, preparing drafts, summaries or proposals for decisions that are then reviewed and validated, assisting in the search for relevant information or pointing out possible risks or incidents that require human attention.
- Optimise citizen service processes by supporting tasks such as managing medical appointments, answering queries about the status of files, facilitating the payment of taxes or guiding people in choosing the most appropriate procedure for their situation.
Various analyses on AI in the public sector suggest that this type of intelligent automation, as in the private sector, can reduce waiting times, improve the quality of decisions and free up staff time for more value-added tasks. A recent report by PWC and Microsoft exploring the potential of Agent AI for the public sector sums up the idea well, noting that by incorporating Agent AI into public services, governments can improve responsiveness and increase citizen satisfaction, provided that the right safeguards are in place.
In addition, the implementation of autonomous agents allows us to dream of a transition from a reactive administration (which waits for the citizen to request a service) to a proactive administration that offers to do part of those same actions for us: from notifying us that a grant has been opened for which we probably meet the requirements, to proposing the renewal of a license before it expires or reminding us of a medical appointment.
An illustrative example of the latter could be an AI agent that, based on data on available services and the information that the citizen himself has authorised to use, detects that a new aid has been published for actions to improve energy efficiency through the renovation of homes and sends a personalised notice to those who could meet the requirements. Even offering them a pre-filled draft application for review and acceptance. The final decision is still human, but the effort of seeking information, understanding conditions, and preparing documentation could be greatly reduced.
The role of open data
For an AI agent to be able to act in a useful and responsible way, they need to leverage on an environment rich in quality data and a robust data governance system. Among those assets needed to develop a good autonomous agent strategy, open data is important in at least three dimensions:
- Fuel for decision-making: AI agents need information on current regulations, service catalogues, administrative procedures, socio-economic and demographic indicators, data on transport, environment, urban planning, etc. To this end, data quality and structure is of great importance as outdated, incomplete, or poorly documented data can lead agents to make costly mistakes. In the public sector, these mistakes can translate into unfair decisions that could ultimately lead to a loss of public trust.
- Testbed for evaluating and auditing agents: Just as open data is important for evaluating generative AI models, it can also be important for testing and auditing autonomous agents. For example, simulating fictitious files with synthetic data based on real distributions to check how an agent acts in different scenarios. In this way, universities, civil society organizations and the administration itself can examine the behavior of agents and detect problems before scaling their use.
- Transparency and explainability: Open data could help document where the data an agent uses came from, how it has been transformed, or which versions of the datasets were in place when a decision was made. This traceability contributes to explainability and accountability, especially when an AI agent intervenes in decisions that affect people's rights or their access to public services. If citizens can consult, for example, the criteria and data that are applied to grant aid, confidence in the system is reinforced.
The panorama of agent AI in Spain and the rest of the world
Although the concept of agent AI is recent, there are already initiatives underway in the public sector at an international level and they are also beginning to make their way in the European and Spanish context:
- The Government Technology Agency (GovTech) of Singapore has published an Agentic AI Primer guide to guide developers and public officials on how to apply this technology, highlighting both its advantages and risks. In addition, the government is piloting the use of agents in various settings to reduce the administrative burden on social workers and support companies in complex licensing processes. All this in a controlled environment (sandbox) to test these solutions before scaling them.
- The UK government has published a specific note within its "AI Insights" documentation to explain what agent AI is and why it is relevant to government services. In addition, it has announced a tender to develop a "GOV.UK Agentic AI Companion" that will serve as an intelligent assistant for citizens from the government portal.
- The European Commission, within the framework of the Apply AI strategy and the GenAI4EU initiative, has launched calls to finance pilot projects that introduce scalable and replicable generative AI solutions in public administrations, fully integrated into their workflows. These calls seek precisely to accelerate the pace of digitalization through AI (including specialized agents) to improve decision-making, simplify procedures and make administration more accessible.
In Spain, although the label "agéntica AI" is not yet widely used, some experiences that go in that direction can already be identified. For example, different administrations are incorporating co-pilots based on generative AI to support public employees in tasks of searching for information, writing and summarizing documents, or managing files, as shown by initiatives of regional governments such as that of Aragon and local entities such as Barcelona City Council that are beginning to document themselves publicly.
The leap towards more autonomous agents in the public sector therefore seems to be a natural evolution on the basis of the existing e-government. But this evolution must, at the same time, reinforce the commitment to transparency, fairness, accountability, human oversight and regulatory compliance required by the AI Regulation and the rest of the regulatory framework and which should guide the actions of the public administration.
Looking to the Future: AI Agents, Open Data, and Citizen Trust
The arrival of agent AI once again offers the public administration new tools to reduce bureaucracy, personalize care and optimize its always scarce resources. However, technology is only a means, the ultimate goal is still to generate public value by reinforcing the trust of citizens.
In principle, Spain is in a good position: it has an Artificial Intelligence Strategy 2024 that is committed to transparent, ethical and human-centred AI, with specific lines to promote its use in the public sector; it has aconsolidated open data infrastructure; and it has created the Spanish Agency for the Supervision of Artificial Intelligence (AESIA) as a body in charge of ensuring an ethical and safe use of AI, in accordance with the European AI Regulation.
We are, therefore, facing a new opportunity for modernisation that can build more efficient, closer and even proactive public services. If we are able to adopt the Agent AI properly, the agents that are deployed will not be a "black box" that acts without supervision, but digital, transparent and auditable "public agents", designed to work with open data, explain their decisions and leave a trace of the actions they take. Tools, in short, inclusive, people-centred and aligned with the values of public service.
Content created by Jose Luis Marín, Senior Consultant in Data, Strategy, Innovation & Digitalisation. The contents and views expressed in this publication are the sole responsibility of the author.