Blog

Open data is a central piece of digital innovation around artificial intelligence as it allows, among other things, to train models or evaluate machine learning algorithms. But between "downloading a CSV from a portal" and accessing a dataset ready to apply machine learning techniques , there is still an abyss.

Much of that chasm has to do with metadata, i.e. how datasets are described (at what level of detail and by what standards). If metadata is limited to title, description, and license, the work of understanding and preparing data becomes more complex and tedious for the person designing the machine learning model. If, on the other hand, standards that facilitate interoperability are used, such as DCAT, the data becomes more FAIR (Findable, Accessible, Interoperable, Reusable) and, therefore, easier to reuse. However, additional metadata is needed  to make the data easier to integrate into machine learning flows.

This article provides an overview of the various initiatives and standards needed to provide open data with metadata that is useful for the application of machine learning techniques.

DCAT as the backbone of open data portals

The DCAT (Data Catalog Vocabulary) vocabulary was designed by the W3C to facilitate interoperability between data catalogs published on the Web. It describes catalogs, datasets, and distributions, being the foundation on which many open data portals are built.

In Europe, DCAT is embodied in the DCAT-AP application profile, recommended by the European Commission and widely adopted to describe datasets in the public sector, for example, in Spain with DCAT-AP-ES. DCAT-AP answers questions such as:

  • What datasets exist on a particular topic?
  • Who publishes them, under what license and in what formats?
  • Where are the download URLs or access APIs?

Using a standard like DCAT is imperative for discovering datasets, but you need to go a step further in order to understand how they are used in machine learning models or what quality they are from the perspective of these models.

MLDCAT-AP: Machine Learning in an Open Data Portal Catalog

MLDCAT-AP (Machine Learning DCAT-AP) is a DCAT application profile developed by SEMIC and the Interoperable Europe community, in collaboration with OpenML, that extends DCAT-AP to the machine learning domain.

MLDCAT-AP incorporates classes and properties to describe:

  • Machine learning models and their characteristics.
  • Datasets used in training and assessment.
  • Quality metrics obtained on datasets.
  • Publications and documentation associated with machine learning models.
  • Concepts related to risk, transparency and compliance with the European regulatory context of the AI Act.

With this, a catalogue based on MLDCAT-AP no longer only responds to "what data is there", but also to:

  • Which models have been trained on this dataset?
  • How has that model performed by certain metrics?
  • Where is this work described (scientific articles, documentation, etc.)?

MLDCAT-AP represents a breakthrough in traceability and governance, but the definition of metadata is maintained at a level that does not yet consider the internal structure of the datasets or what exactly their fields mean. To do this, it is necessary to go down to the level of the structure of the dataset distribution itself.

Metadata at the internal structure level of the dataset

When you want to describe what's inside the distributions of datasets (fields, types, constraints), an interesting initiative is Data Package, part of the Frictionless Data ecosystem.

A Data Package is defined by a JSON file that describes a set of data. This file includes not only general metadata (such as name, title, description or license) and resources (i.e. data files with their path or a URL to access their corresponding service), but also defines a schema with:

  • Field names.
  • Data types (integer, number, string, date, etc.).
  • Constraints, such as ranges of valid values, primary and foreign keys, and so on.

From a machine learning perspective, this translates into the possibility of performing automatic structural validation before using the data. In addition, it also allows for accurate documentation of the internal structure of each dataset and easier sharing and versioning of datasets.

In short, while MLDCAT-AP indicates which datasets exist and how they fit into the realm of machine learning models, Data Package specifies exactly "what's there" within datasets.

Croissant: Metadata that prepares open data for machine learning

Even with the support of MLDCAT-AP and Data Package, it would be necessary to connect the underlying concepts in both initiatives. On the one hand, the field of machine learning (MLDCAT-AP) and on the other hand, that of the internal structures of the data itself (Data Package). In other words, the metadata of MLDCAT-AP and Data Package may be used, but in order to overcome some limitations that both suffer, it is necessary to complement it. This is where Croissant comes into play, a metadata format for preparing datasets for machine learning application. Croissant is developed within the framework of MLCommons, with the participation of industry and academia.

Specifically, Croissant is implemented in JSON-LD and built on top of schema.org/Dataset, a vocabulary for describing datasets on the Web. Croissant combines the following metadata:

  • General metadata of the dataset.
  • Description of resources (files, tables, etc.).
  • Data structure.
  • Semantic layer on machine learning (separation of training/validation/test data, target fields, etc.)

It should be noted that Croissant is designed so that different repositories (such as Kaggle, HuggingFace, etc.) can publish datasets in a format that machine learning libraries (TensorFlow, PyTorch, etc.) can load homogeneously. There is also a CKAN extension to use Croissant in open data portals.

Other complementary initiatives

It is worth briefly mentioning other interesting initiatives related to the possibility of having metadata to prepare datasets for the application of machine learning ("ML-ready datasets"):

  • schema.org/Dataset: Used in web pages and repositories to describe datasets. It is the foundation on which Croissant rests and is integrated, for example, into Google's structured data guidelines to improve the localization of datasets in search engines.
  • CSV on the Web (CSVW): W3C set of recommendations to accompany CSV files with JSON metadata (including data dictionaries), very aligned with the needs of tabular data documentation that is then used in machine learning.
  • Datasheets for Datasets and Dataset Cards: Initiatives that enable the development  of narrative and structured documentation to describe the context, provenance, and limitations of datasets. These initiatives are widely adopted on platforms such as Hugging Face.

Conclusions

There are several initiatives that help to make a suitable metadata definition for the use of machine learning with open data:

  • DCAT-AP and MLDCAT-AP articulate catalog-level, machine learning models, and metrics.
  • Data Package describes and validates the structure and constraints of data at the resource and field level.
  • Croissant connects this metadata to the machine learning flow, describing how the datasets are concrete examples for each model.
  • Initiatives such as CSVW or Dataset Cards complement the previous ones and are widely used on platforms such as HuggingFace.

These initiatives can be used in combination. In fact, if adopted together, open data is  transformed from simply "downloadable files" to machine learning-ready raw material, reducing friction, improving quality, and increasing trust in AI systems built on top of it.

Jose Norberto Mazón, Professor of Computer Languages and Systems at the University of Alicante. The contents and views expressed in this publication are the sole responsibility of the author.

calendar icon
Blog

Three years after the acceleration of the massive deployment of Artificial Intelligence began with the launch of ChatGPT, a new term emerges strongly: Agentic AI. In the last three years, we have gone from talking about language models (such as LLMs) and chatbots (or conversational assistants) to designing the first systems capable not only of answering our questions, but also of acting autonomously to achieve objectives, combining data, tools and collaborations with other AI agents or with humans. That is, the global conversation about AI is moving from the ability to "converse" to the ability to "act" of these systems.

In the private sector, recent reports from large consulting firms describe AI agents that resolve customer incidents from start to finish, orchestrate supply chains, optimize inventories in the retail sector  or automate business reporting. In the public sector, this conversation is also beginning to take shape and more and more administrations are exploring how these systems can help simplify procedures or improve citizen service. However, the deployment seems to be somewhat slower because logically the administration must not only take into account technical excellence but also strict compliance with the regulatory framework, which in Europe is set by the AI Regulation, so that autonomous agents are, above all, allies of citizens.

What is Agentic AI?

Although it is a recent concept that is still evolving, several administrations and bodies are beginning to converge on a definition. For example, the UK government describes agent AI as systems made up of AI agents that "can autonomously behave and interact to achieve their goals." In this context, an AI agent would be a specialized piece of software that can make decisions and operate cooperatively or independently to achieve the system's goals.

We might think, for example, of an AI agent in a local government who receives a request from a person to open a small business. The agent, designed in accordance with the corresponding administrative procedure, would check the applicable regulations, consult urban planning and economic activity data, verify requirements, fill in draft documents, propose appointments or complementary procedures and prepare a summary so that the civil servants could review and validate the application. That is, it would not replace the human decision, but would automate a large part of the work between the request made by the citizen and the resolution issued by the administration.

Compared to a conversational chatbot – which answers a question and, in general, ends the interaction there – an AI agent can chain multiple actions, review results, correct errors, collaborate with other AI agents and continue to iterate until it reaches the goal that has been defined for it. This does not mean that autonomous agents decide on their own without supervision, but that they can take over a good part of the task always following well-defined rules and safeguards.

Key characteristics of a freelance agent include:

  • Perception and reasoning: is the ability of an agent to understand a complex request, interpret the context, and break down the problem into logical steps that lead to solving it.
  • Planning and action: it is the ability to order these steps, decide the sequence in which they are going to be executed, and adapt the plan when the data changes or new constraints appear.
  • Use of tools: An agent can, for example, connect to various APIs, query databases, open data catalogs, open and read documents, or send emails as required by the tasks they are trying to solve.
  • Memory and context: is the ability of the agent to maintain the memory of interactions in long processes, remembering past actions and responses and the current state of the request it is resolving.
  • Supervised autonomy: an agent can make decisions within previously established limits to advance towards the goal without the need for human intervention at each step, but always allowing the review and traceability of decisions.

We could summarize the change it entails with the following analogy: if LLMs are the engine of reasoning, AI agents are systems that , in addition to the ability to "think" about the actions that should be done, have "hands" to interact with the digital world and even with the physical world and execute those same actions.

The potential of AI agents in public services

Public services are organized, to a large extent, around processes of a certain complexity such as the processing of aid and subsidies, the management of files and licenses or the citizen service itself through multiple channels. They are processes with many different steps, rules and actors, where repetitive tasks and manual work of reviewing documentation abound.

As can be seen in the  European Union's eGovernment Benchmark, eGovernment initiatives in recent decades have made it possible to move towards greater digitalisation of public services. However, the new wave of AI technologies, especially when foundational models are combined with agents, opens the door to a new leap to intelligently automate and orchestrate a large part of administrative processes.

In this context, autonomous agents would allow:

  • Orchestrate end-to-end processes such as collecting data from different sources, proposing forms already completed, detecting inconsistencies in the documentation provided, or generating draft resolutions for validation by the responsible personnel.
  • Act as "co-pilots" of public employees, preparing drafts, summaries or proposals for decisions that are then reviewed and validated, assisting in the search for relevant information or pointing out possible risks or incidents that require human attention.
  • Optimise citizen service processes by  supporting tasks such as managing medical appointments, answering queries about the status of files, facilitating the payment of taxes or guiding people in choosing the most appropriate procedure for their situation.

Various analyses on AI in the public sector suggest that this type of intelligent automation, as in the private sector, can reduce waiting times, improve the quality of decisions and free up staff time for more value-added tasks. A recent report by PWC and Microsoft exploring the potential of Agent AI for the public sector sums up the idea well, noting that by incorporating Agent AI into public services, governments can improve responsiveness and increase citizen satisfaction, provided that the right safeguards are in place.

In addition, the implementation of autonomous agents allows us to dream of a transition from a reactive administration (which waits for the citizen to request a service) to a proactive administration that offers to do part of those same actions for us: from notifying us that a grant has been opened for which we probably meet the requirements,  to proposing the renewal of a license before it expires or reminding us of a medical appointment.

An illustrative example of the latter could be an AI agent that, based on data on available services and the information that the citizen himself has authorised to use, detects that a new aid has been published for actions to improve energy efficiency through the renovation of homes and sends a personalised notice to those who could meet the requirements. Even offering them a pre-filled draft application for review and acceptance. The final decision is still human, but the effort of seeking information, understanding conditions, and preparing documentation could be greatly reduced.

The role of open data

For an AI agent to be able to act in a useful and responsible way, they need to leverage on an environment rich in quality data and a robust data governance system. Among those assets needed to develop a good autonomous agent strategy, open data is important in at least three dimensions:

  1. Fuel for decision-making: AI agents need information on current regulations, service catalogues, administrative procedures, socio-economic and demographic indicators, data on transport, environment, urban planning, etc. To this end, data quality and structure is of great importance as outdated, incomplete, or poorly documented data can lead agents to make costly mistakes. In the public sector, these mistakes can translate into unfair decisions that could ultimately lead to a loss of public trust.
  2. Testbed for evaluating and auditing agents: Just as open data is important for evaluating generative AI models, it can also be important for testing and auditing autonomous agents. For example, simulating fictitious files with synthetic data based on real distributions to check how an agent acts in different scenarios. In this way, universities, civil society organizations and the administration itself can examine the behavior of agents and detect problems before scaling their use.
  3. Transparency and explainability: Open data could help document where the data an agent uses came from, how it has been transformed, or which versions of the datasets were in place when a decision was made. This traceability contributes to explainability and accountability, especially when an AI agent intervenes in decisions that affect people's rights or their access to public services. If citizens can consult, for example, the criteria and data that are applied to grant aid, confidence in the system is reinforced.

The panorama of agent AI in Spain and the rest of the world

Although the concept of agent AI is recent, there are already initiatives underway in the public sector at an international level and they are also beginning to make their way in the European and Spanish context:

  • The Government Technology Agency (GovTech) of Singapore has published an Agentic AI Primer guide  to guide developers and public officials on how to apply this technology, highlighting both its advantages and risks. In addition, the government is piloting the use of agents in various settings to reduce the administrative burden on social workers and support companies in complex licensing processes. All this in a controlled environment (sandbox) to test these solutions before scaling them.
  • The UK government has published a specific note within its "AI Insights" documentation to explain what agent AI is and why it is relevant to government services. In addition, it has announced a tender to develop a "GOV.UK Agentic AI Companionthat will serve as an intelligent assistant for citizens from the government portal.
  • The European Commission, within the framework of the Apply AI strategy and the GenAI4EU initiative, has launched calls to finance pilot projects that introduce scalable and replicable generative AI solutions in public administrations, fully integrated into their workflows. These calls seek precisely to accelerate the pace of digitalization through AI (including specialized agents) to improve decision-making, simplify procedures and make administration more accessible.

In Spain, although the label "agéntica AI" is not yet widely used, some experiences that go in that direction can already be identified. For example, different administrations are incorporating co-pilots based on generative AI to support public employees in tasks of searching for information, writing and summarizing documents, or managing files, as shown by initiatives of regional governments such as that of Aragon and local entities such as Barcelona City Council that are beginning to document themselves publicly.

The leap towards more autonomous agents in the public sector therefore seems to be a natural evolution on the basis of the existing e-government. But this evolution must, at the same time, reinforce the commitment to transparency, fairness, accountability, human oversight and regulatory compliance required by the AI Regulation and the rest of the regulatory framework and which should guide the actions of the public administration.

Looking to the Future: AI Agents, Open Data, and Citizen Trust

The arrival of agent AI once again offers the public administration new tools to reduce bureaucracy, personalize care and optimize its always scarce resources. However, technology is only a means, the ultimate goal is still  to generate public value by reinforcing the trust of citizens.

In principle, Spain is in a good position: it has an Artificial Intelligence Strategy 2024 that is committed to transparent, ethical and human-centred AI, with specific lines to promote its use in the public sector; it has aconsolidated open data infrastructure; and it has created the Spanish Agency for the Supervision of Artificial Intelligence (AESIA) as a body in charge of ensuring an ethical and safe use of AI, in accordance with the European AI Regulation.

We are, therefore, facing a new opportunity for modernisation that can build more efficient, closer and even proactive public services. If we are able to adopt the Agent AI properly, the agents that are deployed will not be a "black box" that acts without supervision, but digital, transparent and auditable "public agents", designed to work with open data, explain their decisions and leave a trace of the actions they takeTools, in short, inclusive, people-centred and aligned with the values of public service.

Content created by Jose Luis Marín, Senior Consultant in Data, Strategy, Innovation & Digitalisation. The contents and views expressed in this publication are the sole responsibility of the author.

calendar icon
Noticia

 On 19 November, the European Commission presented the Data Union Strategy, a roadmap that seeks to consolidate a robust, secure and competitive European data ecosystem. This strategy is built around three key pillars: expanding access to quality data for artificial intelligence and innovation, simplifying the existing regulatory framework, and protecting European digital sovereignty. In this post, we will explain each of these pillars in detail, as well as the implementation timeline of the plan planned for the next two years.

Pillar 1: Expanding access to quality data for AI and innovation

The first pillar of the strategy focuses on ensuring that companies, researchers and public administrations have access to high-quality data that allows the development of innovative applications, especially in the field of artificial intelligence. To this end, the Commission proposes a number of interconnected initiatives ranging from the creation of infrastructure to the development of standards and technical enablers. A series of actions are established as part of this pillar: the expansion of common European data spaces, the development of data labs, the promotion of the Cloud and AI Development Act, the expansion of strategic data assets and the development of facilitators to implement these measures.

1.1 Extension of the Common European Data Spaces (ECSs)

Common European Data Spaces are one of the central elements of this strategy:

  • Planned investment: 100 million euros for its deployment.

  • Priority sectors: health, mobility, energy, (legal) public administration and environment.

  • Interoperability: SIMPL is committed  to interoperability between data spaces with the support of the European Data Spaces Support Center (DSSC).

  • Key Applications:

    • European Health Data Space (EHDS): Special mention for its role as a bridge between health data systems and the development of AI.

    • New Defence Data Space: for the development of state-of-the-art systems, coordinated by the European Defence Agency.

1.2 Data Labs: the new ecosystem for connecting data and AI development

The strategy proposes to use Data Labs as points of connection between the development of artificial intelligence and European data.

These labs employ data pooling, a process of combining and sharing public and restricted data from multiple sources in a centralized repository or shared environment. All this facilitates access and use of information. Specifically, the services offered by Data Labs are:

  • Makes it easy to access data.

  • Technical infrastructure and tools.

  • Data pooling.

  • Data filtering and labeling 

  • Regulatory guidance and training.

  • Bridging the gap between data spaces and AI ecosystems.

Implementation plan:

  • First phase: the first Data Labs will be established within the framework of AI Factories (AI gigafactories), offering data services to connect AI development with European data spaces.

  • Sectoral Data Labs: will be established independently in other areas to cover specific needs, for example, in the energy sector.

  • Self-sustaining model: It is envisaged that the Data Labs model  can be deployed commercially, making it a self-sustaining ecosystem that connects data and AI.

1.3 Cloud and AI Development Act: boosting the sovereign cloud

To promote cloud technology, the Commission will propose this new regulation in the first quarter of 2026. There is currently an open public consultation in which you can participate here.

1.4 Strategic data assets: public sector, scientific, cultural and linguistic resources

On the one hand, in 2026 it will be proposed to expand the list of high-value data  in English or HVDS to include legal, judicial and administrative data, among others. And on the other hand, the Commission will map existing bases and finance new digital infrastructure.

1.5 Horizontal enablers: synthetic data, data pooling, and standards

The European Commission will develop guidelines and standards on synthetic data and advanced R+D in techniques for its generation will be funded through Horizon Europe.

Another issue that the EU wants to promote is data pooling, as we explained above. Sharing data from early stages of the production cycle can generate collective benefits, but barriers persist due to legal uncertainty and fear of violating competition rules. Its purpose? Make data pooling a reliable and legally secure option to accelerate progress in critical sectors.

Finally, in terms of standardisation, the European standardisation organisations (CEN/CENELEC) will be asked to develop new technical standards in two key areas: data quality and labelling. These standards will make it possible to establish common criteria on how data should be to ensure its reliability and how it should be labelled to facilitate its identification and use in different contexts.

Pillar 2: Regulatory simplification

The second pillar addresses one of the challenges most highlighted by companies and organisations: the complexity of the European regulatory framework on data. The strategy proposes a series of measures aimed at simplifying and consolidating existing legislation.

2.1 Derogations and regulatory consolidation: towards a more coherent framework

The aim is to eliminate regulations whose functions are already covered by more recent legislation, thus avoiding duplication and contradictions. Firstly, the Free Flow of Non-Personal Data Regulation (FFoNPD) will be repealed, as its functions are now covered by the Data Act. However, the prohibition of unjustified data localisation, a fundamental principle for the Digital Single Market, will be explicitly preserved.

Similarly, the Data Governance Act  (European Data Governance Regulation or DGA) will be eliminated as a stand-alone rule, migrating its essential provisions to the Data Act. This move simplifies the regulatory framework and also eases the administrative burden: obligations for data intermediaries will become lighter and more voluntary.

As for the public sector, the strategy proposes an important consolidation. The rules on public data sharing, currently dispersed between the DGA and the Open Data Directive, will be merged into a single chapter within the Data Act. This unification will facilitate both the application and the understanding of the legal framework by public administrations.

2.2 Cookie reform: balancing protection and usability

Another relevant detail is the regulation of cookies, which will undergo a significant modernization, being integrated into the framework of the General Data Protection Regulation (GDPR). The reform seeks a balance: on the one hand, low-risk uses that currently generate legal uncertainty will be legalized; on the other,  consent banners will be simplified  through "one-click" systems. The goal is clear: to reduce the so-called "user fatigue" in the face of the repetitive requests for consent that we all know when browsing the Internet.

2.3 Adjustments to the GDPR to facilitate AI development

The General Data Protection Regulation will also be subject to a targeted reform, specifically designed to release data responsibly for the benefit of the development of artificial intelligence. This surgical intervention addresses three specific aspects:

  1. It clarifies when legitimate interest for AI model training may apply.

  2. It defines more precisely the distinction between anonymised and pseudonymised data, especially in relation to the risk of re-identification.

  3. It harmonises data protection impact assessments, facilitating their consistent application across the Union.

2. 4 Implementation and Support for the Data Act

The recently approved Data Act will be subject to adjustments to improve its application. On the one hand, the scope of business-to-government ( B2G) data sharing is refined, strictly limiting it to emergency situations. On the other hand, the umbrella of protection is extended: the favourable conditions currently enjoyed by small and medium-sized enterprises (SMEs) will also be extended to medium-sized companies or small mid-caps, those with between 250 and 749 employees.

To facilitate the practical implementation of the standard, a model contractual clause for data exchange has already been published , thus providing a template that organizations can use directly. In addition, two additional guides will be published during the first quarter of 2026: one on the concept of "reasonable compensation" in data exchanges, and another aimed at clarifying the key definitions of the Data Act that may generate interpretative doubts.

Aware that SMEs may struggle to navigate this new legal framework, a Legal Helpdesk  will be set up in the fourth quarter of 2025. This helpdesk will provide direct advice on the implementation of the Data Act, giving priority precisely to small and medium-sized enterprises that lack specialised legal departments.

2.5 Evolving governance: towards a more coordinated ecosystem

The governance architecture of the European data ecosystem is also undergoing significant changes. The European Data Innovation Board (EDIB) evolves from a primarily advisory body to a forum for more technical and strategic discussions, bringing together both Member States and industry representatives. To this end, its articles will be modified with two objectives: to allow the inclusion of the competent authorities in the debates on Data Act, and to provide greater flexibility to the European Commission in the composition and operation of the body.

In addition, two additional mechanisms of feedback and anticipation are articulated. The Apply AI Alliance will channel  sectoral feedback, collecting the specific experiences and needs of each industry. For its part, the AI Observatory will act as a trend radar, identifying emerging developments in the field of artificial intelligence and translating them into public policy recommendations. In this way, a virtuous circle is closed where politics is constantly nourished by the reality of the field.

Pillar 3: Protecting European data sovereignty

The third pillar focuses on ensuring that European data is treated fairly and securely, both inside and outside the Union's borders. The intention is that data will only be shared with countries with the same regulatory vision.

3.1 Specific measures to protect European data

  • Publication of guides to assess the fair treatment of EU data abroad (Q2 2026):

  • Publication of the Unfair Practices Toolbox  (Q2 2026):

    • Unjustified location.

    • Exclusion.

    • Weak safeguards.

    • The data leak.

  • Taking measures to protect sensitive non-personal data.

All these measures are planned to be implemented from the last quarter of 2025 and throughout 2026 in a progressive deployment that will allow a gradual and coordinated adoption of the different measures, as established in the Data Union Strategy.

In short, the Data Union Strategy represents a comprehensive effort to consolidate European leadership in the data economy. To this end, data pooling and data spaces in the Member States will  be promoted, Data Labs and AI gigafactories will be committed to and regulatory simplification will be encouraged.

calendar icon
Blog

The convergence between open data, artificial intelligence and environmental sustainability poses one of the main challenges for the digital transformation model that is being promoted at European level. This interaction is mainly materialized in three outstanding manifestations:

  • The opening of high-value data directly related to sustainability, which can help the development of artificial intelligence solutions aimed at climate change mitigation and resource efficiency.

  • The promotion of the so-called green algorithms in the reduction of the environmental impact of AI, which must be materialized both in the efficient use of digital infrastructure and in sustainable decision-making.

  • The commitment to environmental data spaces, generating digital ecosystems where data from different sources is shared to facilitate the development of interoperable projects and solutions with a relevant impact from an environmental perspective.

Below, we will delve into each of these points.

High-value data for sustainability

 Directive (EU) 2019/1024 on open data and re-use of public sector information introduced for the first time the concept of high-value datasets, defined as those with exceptional potential to generate social, economic and environmental benefits. These sets should be published free of charge, in machine-readable formats, using application programming interfaces (APIs) and, where appropriate, be available for bulk download. A number of priority categories have been identified for this purpose, including environmental and Earth observation data.

This is a particularly relevant category, as it covers both data on climate, ecosystems or environmental quality, as well as those linked to the INSPIRE Directive, which refer to certainly diverse areas such as hydrography, protected sites, energy resources, land use, mineral resources or, among others, those related to areas of natural hazards, including orthoimages.

These data are particularly relevant when it comes to monitoring variables related to climate change, such as land use, biodiversity management taking into account the distribution of species, habitats and protected sites, monitoring of invasive species or the assessment of natural risks. Data on air quality and pollution are crucial for public and environmental health, so access to them allows  exhaustive analyses to be carried out,  which are undoubtedly relevant for the adoption of public policies aimed at improving them. The management of water resources can also be optimized through hydrography data and environmental monitoring, so that its massive and automated treatment is an inexcusable premise to face the challenge of the digitalization of water cycle management.

Combining it with other quality environmental data facilitates the development of AI solutions geared towards specific climate challenges. Specifically, they allow predictive models to be trained to anticipate extreme phenomena (heat waves, droughts, floods), optimize the management of natural resources or monitor critical environmental indicators in real time. It also makes it possible to promote high-impact economic projects, such as the use of AI algorithms to implement technological solutions in the field of precision agriculture, enabling the intelligent adjustment of irrigation systems, the early detection of pests or the optimization of the use of fertilizers.

Green algorithms and digital responsibility: towards sustainable AI

Training and deploying AI systems, particularly general-purpose models and large language models, involves significant energy consumption. According to estimates by the International Energy Agency, data centers accounted for around 1.5% of global electricity consumption in 2024. This represents a growth of around 12% per year since 2017, more than four times faster than the rate of total electricity consumption. Data center power consumption is expected to double to around 945 TWh by 2030.

Against this backdrop, green algorithms are an alternative that must necessarily be taken into account when it comes to minimising the environmental impact posed by the implementation of digital technology and, specifically, AI. In fact, both the European Data Strategy and the European Green Deal explicitly integrate digital sustainability as a strategic pillar. For its part, Spain has launched a National Green Algorithm Programme, framed in the 2026 Digital Agenda and with a specific measure in the National Artificial Intelligence Strategy.

One of the main objectives of the Programme is to promote the development of algorithms that minimise their environmental impact from conception ( green by design), so the requirement of exhaustive documentation of the datasets used to train AI models – including origin, processing, conditions of use and environmental footprint – is essential to fulfil this aspiration. In this regard, the Commission has published a template to help general-purpose AI providers summarise the data used for the training of their models, so that greater transparency can be demanded, which, for the purposes of the present case, would also facilitate traceability and responsible governance from an environmental perspective.  as well as the performance of eco-audits.

The European Green Deal Data Space

It is one of the common European data spaces contemplated in the European Data Strategy that is at a more advanced stage, as demonstrated by the numerous initiatives and dissemination events that have been promoted around it. Traditionally, access to environmental information has been one of the areas with the most favourable regulation, so that with the promotion of high-value data and the firm commitment to the creation of a European area in this area, there has been a very remarkable qualitative advance that reinforces an already consolidated trend in this area.

Specifically, the data spaces model facilitates interoperability between public and private open data, reducing barriers to entry for startups and SMEs in sectors such as smart forest managementprecision agriculture or, among many other examples, energy optimization. At the same time, it reinforces the quality of the data available for Public Administrations to carry out their public policies, since their own sources can be contrasted and compared with other data sets. Finally, shared access to data and AI tools can foster collaborative innovation initiatives and projects, accelerating the development of interoperable and scalable solutions.

However, the legal ecosystem of data spaces entails a complexity inherent in its own institutional configuration, since it brings together several subjects and, therefore, various interests and applicable legal regimes: 

  • On the one hand, public entities, which have a particularly reinforced leadership role in this area.

  • On the other hand, private entities and citizens, who can not only contribute their own datasets, but also offer digital developments and tools that value data through innovative services.

  • And, finally, the providers of the infrastructure necessary for interaction within the space.

Consequently,  advanced governance models are essential to deal with this complexity, reinforced by technological innovation and especially AI, since the traditional approaches of legislation regulating access to environmental information are certainly limited for this purpose.

Towards strategic convergence

The convergence of high-value open data, responsible green algorithms and environmental data spaces is shaping a new digital paradigm that is essential to address climate and ecological challenges in Europe that requires a robust and, at the same time, flexible legal approach. This unique ecosystem not only allows innovation and efficiency to be promoted in key sectors such as precision agriculture or energy management, but also reinforces the transparency and quality of the environmental information available for the formulation of more effective public policies.

Beyond the current regulatory framework, it is essential to design governance models that help to interpret and apply diverse legal regimes in a coherent manner, that protect data sovereignty and, ultimately, guarantee transparency and responsibility in the access and reuse of environmental information. From the perspective of sustainable public procurement, it is essential to promote procurement processes by public entities that prioritise technological solutions and interoperable services based on open data and green algorithms, encouraging the choice of suppliers committed to environmental responsibility and transparency in the carbon footprints of their digital products and services.

Only on the basis of this approach can we aspire to make digital innovation technologically advanced and environmentally sustainable, thus aligning the objectives of the Green Deal, the European Data Strategy and the European approach to AI.

Content prepared by Julián Valero, professor at the University of Murcia and coordinator of the Innovation, Law and Technology Research Group (iDerTec). The content and views expressed in this publication are the sole responsibility of the author.

 

calendar icon
Noticia

Did you know that less than two out of ten European companies use artificial intelligence (AI) in their operations? This data, corresponding to 2024, reveals the margin for improvement in the adoption of this technology. To reverse this situation and take advantage of the transformative potential of AI, the European Union has designed a comprehensive strategic framework that combines investment in computing infrastructure, access to quality data and specific measures for key sectors such as health, mobility or energy.

In this article we explain the main European strategies in this area, with a special focus on the Apply AI Strategy or the AI Continent Action Plan , adopted this year in October and April respectively. In addition, we will tell you how these initiatives complement other European strategies to create a comprehensive innovation ecosystem.

Context: Action plan and strategic sectors

On the one hand, the AI Continent Action Plan establishes five strategic pillars:

  1. Computing infrastructures: scaling computing capacity through AI Factories, AI Gigafactories and the Cloud and AI Act, specifically:
    • AI factories: infrastructures to train and improve artificial intelligence models will be promoted. This strategic axis has a budget of 10,000 million euros and is expected to lead to at least 13 AI factories by 2026.
    • Gigafactorie AI: the infrastructures needed to train and develop complex AI models will also be taken into account, quadrupling the capacity of AI factories. In this case, 20,000 million euros are invested for the development of 5 gigafactories.
    • Cloud and AI Act: Work is being done on a regulatory framework to boost research into highly sustainable infrastructure, encourage investments and triple the capacity of EU data centres over the next five to seven years.
  2. Access to quality data: facilitate access to robust and well-organized datasets through the so-called Data Labs in AI Factories.
  3. Talent and skills: strengthening AI skills across the population, specifically:
    • Create international collaboration agreements.
    • To offer scholarships in AI for the best students, researchers and professionals in the sector.
    • Promote skills in these technologies through a specific academy.
    • Test a specific degree in generative AI.
    • Support training updating through the European Digital Innovation Hub.
  4. Development and adoption of algorithms: promoting the use of artificial intelligence in strategic sectors.
  5. Regulatory framework: Facilitate compliance with the AI Regulation in a simple and innovative way and provide free and adaptable tools for companies.

On the other hand, the recently presented, in October 2025, Apply AI Strategy seeks to boost the competitiveness of strategic sectors and strengthen the EU's technological sovereignty, driving AI adoption and innovation across Europe, particularly among small and medium-sized enterprises. How? The strategy promotes an "AI first" policy, which encourages organizations to consider artificial intelligence as a potential solution whenever they make strategic or policy decisions, carefully evaluating both the benefits and risks of the technology. In addition, it encourages a European procurement approach, i.e. organisations, particularly public administrations, prioritise solutions developed in Europe. Moreover, special importance is given to open source AI solutions, because they offer greater transparency and adaptability, less dependence on external providers and are aligned with the European values of openness and shared innovation.

The Apply AI Strategy is structured in three main sections:

Flagship sectoral initiatives

The strategy identifies 11 priority areas where AI can have the greatest impact and where Europe has competitive strengths:

  • Healthcare and pharmaceuticals: AI-powered advanced European screening centres will be established to accelerate the introduction of innovative prevention and diagnostic tools, with a particular focus on cardiovascular diseases and cancer.
  • Robotics: Adoption will be driven for the adoption of European robotics connecting developers and user industries, driving AI-powered robotics solutions.
  • Manufacturing, engineering and construction: the development of cutting-edge AI models adapted to industry will be supported, facilitating the creation of digital twins and optimisation of production processes.
  • Defence, security and space: the development of AI-enabled European situational awareness and control capabilities will be accelerated, as well as highly secure computing infrastructure for defence and space AI models.
  • Mobility, transport and automotive: the "Autonomous Drive Ambition Cities" initiative will be launched to accelerate the deployment of autonomous vehicles in European cities.
  • Electronic communications: a European AI platform for telecommunications will be created that will allow operators, suppliers and user industries to collaborate on the development of open source technological elements.
  • Energy: the development of AI models will be supported to improve the forecasting, optimization and balance of the energy system.
  • Climate and environment: An open-source AI model of the Earth system and related applications will be deployed to enable better weather forecasting, Earth monitoring, and what-if scenarios.
  • Agri-food: the creation of an agri-food AI platform will be promoted to facilitate the adoption of agricultural tools enabled by this technology.
  • Cultural and creative sectors, and media: the development of micro-studios specialising in AI-enhanced virtual production and pan-European platforms using multilingual AI technologies will be incentivised.
  • Public sector: A dedicated AI toolkit for public administrations will be built with a shared repository of good practices, open source and reusable, and the adoption of scalable generative AI solutions will be accelerated.

Cross-cutting support measures

For the adoption of artificial intelligence to be effective, the strategy addresses challenges common to all sectors, specifically:

  • Opportunities for European SMEs: The more than 250 European Digital Innovation Hubs have been transformed into AI Centres of Expertise. These centres act as privileged access points to the European AI innovation ecosystem, connecting companies with AI Factories, data labs and testing facilities.
  • AI-ready workforce: Access to practical AI literacy training, tailored to sectors and professional profiles, will be provided through the AI Skills Academy.
  • Supporting the development of advanced AI: The Frontier AI Initiative seeks to accelerate progress on cutting-edge AI capabilities in Europe. Through this project, competitions will be created to develop advanced open-source artificial intelligence models, which will be available to public administrations, the scientific community and the European business sector.
  • Trust in the European market: Disclosure will be strengthened to ensure compliance with the European Union's AI Regulation, providing guidance on the classification of high-risk systems and on the interaction of the Regulation with other sectoral legislation.

New governance system

In this context, it is particularly important to ensure proper coordination of the strategy. Therefore, the following is proposed:

  • Apply AI AllianceThe existing AI Alliance becomes the premier coordination forum that brings together AI vendors, industry leaders, academia, and the public sector. Sector-specific groups will allow the implementation of the strategy to be discussed and monitored.
  • AI Observatory: An AI Observatory will be established to provide robust indicators assessing its impact on currently listed and future sectors, monitor developments and trends.

Complementary strategies: science and data as the main axes

The Apply AI Strategy does not act in isolation, but is complemented by two other fundamental strategies: the AI in Science Strategy and the Data Union Strategy.

AI in Science Strategy

Presented together with the Apply AI Strategy, this strategy supports and incentivises the development and use of artificial intelligence by the European scientific community. Its central element is RAISE (Resource for AI Science in Europe), which was presented in November at the AI in Science Summit and will bring together strategic resources: funding, computing capacity, data and talent. RAISE will operate on two pillars: Science for AI (basic research to advance fundamental capabilities) and AI in Science (use of artificial intelligence for progress in different scientific disciplines).

Data Union Strategy

This strategy will focus on ensuring the availability of high-quality, large-scale datasets, essential for training AI models. A key element will be the Data Labs associated with the AI Factories, which will bring together and federate data from different sectors, linking with the  corresponding European Common Data Spaces, making them available to developers under the appropriate conditions.

In short, through significant investments in infrastructure, access to quality data, talent development and a regulatory framework that promotes responsible innovation, the European Union is creating the necessary conditions for companies, public administrations and citizens to take advantage of the full transformative potential of artificial intelligence. The success of these strategies will depend on collaboration between European institutions, national governments, businesses, researchers and developers.

calendar icon
Blog

We live surrounded by AI-generated summaries. We have had the option of generating them for months, but now they are imposed on digital platforms as the first content that our eyes see when using a search engine or opening an email thread. On platforms such as Microsoft Teams or Google Meet, video call meetings are transcribed and summarized in automatic minutes for those who have not been able to be present, but also for those who have been there. However, what a language model has considered important, is it really important for the person receiving the summary?

In this new context, the key is to learn to recover the meaning behind so much summarized information. These three strategies will help you transform automatic content into an understanding and decision-making tool.

1. Ask expansive questions

We tend to summarize to reduce content that we are not able to cover, but we run the risk of associating brief with significant, an equivalence that is not always fulfilled. Therefore, we should not focus from the beginning on summarizing, but on extracting relevant information for us, our context, our vision of the situation and our way of thinking. Beyond the  basic prompt "give me a summary", this new way of approaching content that escapes us consists of cross-referencing data, connecting dots and suggesting hypotheses, which they call  sensemaking. And it happens, first of all, to be clear about what we want to know.

Practical situation:

Imagine a long meeting that we have not been able to attend. That afternoon, we received in our email a summary of the topics discussed. It's not always possible, but a good practice at this point, if our organization allows it, is not to just stay with the summary: if allowed, and always respecting confidentiality guidelines, upload the full transcript to a conversational system such as Copilot or Gemini and ask specific questions:

  • Which topic was repeated the most or received the most attention during the meeting?

  • In a previous meeting, person X used this argument. Was it used again? Did anyone discuss it? Was it considered valid?

  • What premises, assumptions or beliefs are behind this decision that has been made?

  • At the end of the meeting, what elements seem most critical to the success of the project?

  • What signs anticipate possible delays or blockages? Which ones have to do with or could affect my team?

Beware of:

First of all, review and confirm the attributions. Generative models are becoming more and more accurate, but they have a great ability to mix real information with false or generated information. For example, they can attribute a phrase to someone who did not say it, relate ideas as cause and effect that were not really connected, and surely most importantly: assign tasks or responsibilities for next steps to someone who does not correspond.

2. Ask for structured content

Good summaries are not shorter, but more organized, and the written text is not the only format we can use. Look for efficiency and ask conversational systems to return tables, categories, decision lists or relationship maps. Form conditions thought: if you structure information well, you will understand it better and also transmit it better to others, and therefore you will go further with it.

Practical situation:

In this case, let's imagine that we received a long report on the progress of several internal projects of our company. The document has many pages with paragraphs descriptive of status, feedback, dates, unforeseen events, risks and budgets. Reading everything line by line would be impossible and we would not retain the information. The good practice here is to ask for a transformation of the document that is really useful to us. If possible, upload the report to the conversational system and request structured content in a demanding way and without skimping on details:

  • Organize the report in a table with the following columns: project, responsible, delivery date, status, and a final column that indicates if any unforeseen event has occurred or any risk has materialized. If all goes well, print in that column "CORRECT".

  • Generate a visual calendar with deliverables, their due dates, and assignees, starting on October 1, 2025 and ending on January 31, 2026, in the form of a Gantt chart.

  • I want a list that only includes the name of the projects, their start date, and their due date. Sort by delivery date, closest first.

  • From the  customer feedback section  that you will find in each project, create a table with the most repeated comments and which areas or teams they usually refer to. Place them in order, from the most repeated to the least.

  • Give me the billing of the projects that are at risk of not meeting deadlines, indicate the price of each one and the total.

Beware of:

The illusion of veracity and completeness that a clean, orderly, automatic text with fonts will provide us is enormous. A clear format, such as a table, list, or map, can give a false sense of accuracy. If the source data is incomplete or wrong, the structure only makes up the error and we will have a harder time seeing it. AI productions are usually almost perfect. At the very least, and if the document is very long, do random checks ignoring the form and focusing on the content.

3. Connect the dots

Strategic sense is rarely in an isolated text, let alone in a summary. The advanced level in this case consists of asking the multimodal chat to cross-reference sources, compare versions or detect patterns between various materials or formats, such as the transcript of a meeting, an internal report and a scientific article. What is really interesting to see are comparative keys such as evolutionary changes, absences or inconsistencies.

Practical situation:

Let's imagine that we are preparing a proposal for a new project. We have several materials: the transcript of a management team meeting, the previous year's internal report, and a recent article on industry trends. Instead of summarizing them separately, you can upload them to the same conversation thread or chat you've customized on the topic, and ask for more ambitious actions.

  • Compare these three documents and tell me which priorities coincide in all of them, even if they are expressed in different ways.

  • What topics in the internal report were not mentioned at the meeting? Generate a hypothesis for each one as to why they have not been treated.

  • What ideas in the article might reinforce or challenge ours? Give me ideas that are not reflected in our internal report.

  • Look for articles in the press from the last six months that support the strong ideas of the internal report.

  • Find external sources that complement the information missing in these three documents on topic X, and generate a panoramic report with references.

Beware of:

It is very common for AI systems  to deceptively simplify complex discussions, not because they have a hidden purpose but because they have always been rewarded for simplicity and clarity in training. In addition, automatic generation introduces a risk of authority: because the text is presented with the appearance of precision and neutrality, we assume that it is valid and useful. And if that wasn't enough, structured summaries are copied and shared quickly. Before forwarding, make sure that the content is validated, especially if it contains sensitive decisions, names, or data.

AI-based models can help you visualize convergences, gaps, or contradictions and, from there, formulate hypotheses or lines of action. It is about finding with greater agility what is so valuable that we call insights. That is the step from summary to analysis: the most important thing is not to compress the information, but to select it well, relate it and connect it with the context. Intensifying the demand from the prompt is the most appropriate way to work with AI systems, but it also requires a previous personal effort of analysis and landing.

Content created by Carmen Torrijos, expert in AI applied to language and communication. The content and views expressed in this publication are the sole responsibility of the author.

calendar icon
Blog

 Artificial Intelligence (AI) is transforming society, the economy and public services at an unprecedented speed. This revolution brings enormous opportunities, but also challenges related to ethics, security and the protection of fundamental rights. Aware of this, the European Union approved the Artificial Intelligence Act (AI Act), in force since August 1, 2024, which establishes a harmonized and pioneering framework for the development, commercialization and use of AI systems in the single market, fostering innovation while protecting citizens.

A particularly relevant area of this regulation is general-purpose AI models (GPAI), such as large language models (LLMs) or multimodal models, which are trained on huge volumes of data from a wide variety of sources (text, images and video, audio and even user-generated data). This reality poses critical challenges in intellectual property, data protection and transparency on the origin and processing of information.

To address them, the European Commission, through the European AI Office, has published the Template for the Public Summary of Training Content for general-purpose AI models: a standardized format that providers will be required  to complete and publish to summarize key information about the data used in training. From 2 August 2025, any general-purpose model placed on the market or distributed in the EU must be accompanied by this summary; models already on the market have until 2 August 2027 to adapt. This measure materializes the AI Act's principle of transparency and aims to shed light on the "black boxes" of AI.

In this article, we explain this template keys´s: from its objectives and structure, to information on deadlines, penalties, and next steps.

Objectives and relevance of the template

General-purpose AI models are trained on data from a wide variety of sources and modalities, such as:

  • Text: books, scientific articles, press, social networks.

  • Images and videos: digital content from the Internet and visual collections.

  • Audio: recordings, podcasts, radio programs, or conversations.

  • User data: information generated in interaction with the model itself or with other services of the provider.

This process of mass data collection is often opaque, raising concerns among rights holders, users, regulators, and society as a whole. Without transparency, it is difficult to assess whether data has been obtained lawfully, whether it includes unauthorised personal information or whether it adequately represents the cultural and linguistic diversity of the European Union.

Recital 107 of the AI Act states that the main objective of this template is to increase transparency and facilitate the exercise and protection of rights. Among the benefits it provides, the following stand out:

  1. Intellectual property protection: allows authors, publishers and other rights holders to identify if their works have been used during training, facilitating the defense of their rights and a fair use of their content.

  2. Privacy safeguard:  helps detect whether personal data has been used, providing useful information so that affected individuals can exercise their rights under the General Data Protection Regulation (GDPR) and other regulations in the same field.

  3. Prevention of bias and discrimination: provides information on the linguistic and cultural diversity of the sources used, key to assessing and mitigating biases that may lead to discrimination.

  4. Fostering competition and research: reduces "black box" effects and facilitates academic scrutiny, while helping other companies better understand where data comes from, favoring more open and competitive markets.

In short, this template is not only a legal requirement, but a tool to build trust in artificial intelligence, creating an ecosystem in which technological innovation and the protection of rights are mutually reinforcing.

Template structure

The template, officially published on 24 July 2025 after a public consultation with more than 430 participating organisations, has been designed so that the information is presented in a clear, homogeneous and understandable way, both for specialists and for the public.

It consists of three main sections, ranging from basic model identification to legal aspects related to data processing.

1. General information

It provides a global view of the provider, the model, and the general characteristics of the training data:

  • Identification of the supplier, such as name and contact details.

  • Identification of the model and its versions, including dependencies if it is a modification (fine-tuning) of another model.

  • Date of placing the model on the market in the EU.

  • Data modalities used (text, image, audio, video, or others).

  • Approximate size of data by modality, expressed in wide ranges (e.g., less than 1 billion tokens, between 1 billion and 10 trillion, more than 10 trillion).

  • Language coverage, with special attention to the official languages of the European Union.

This section provides a level of detail sufficient to understand the extent and nature of the training, without revealing trade secrets.

2. List of data sources

It is the core of the template, where the origin of the training data is detailed. It is organized into six main categories, plus a residual category (other).

  1. Public datasets:

    • Data that is freely available and downloadable as a whole or in blocks (e.g., open data portals, common crawl, scholarly repositories).

    • "Large" sets must be identified, defined as those that represent more than 3% of the total public data used in a specific modality.

  2. Licensed private sets:

    • Data obtained through commercial agreements with rights holders or their representatives, such as licenses with publishers for the use of digital books.

    • A general description is provided only.

  3. Other unlicensed private data:

    • Databases acquired from third parties that do not directly manage copyright.

    • If they are publicly known, they must be listed; otherwise, a general description (data type, nature, languages) is sufficient.

  4. Data obtained through web crawling/scraping:

    • Information collected by or on behalf of the supplier using automated tools.

    • It must be specified:

      • Name/identifier of the trackers.

      • Purpose and behavior (respect for robots.txt, captchas, paywalls, etc.).

      • Collection period.

      • Types of websites (media, social networks, blogs, public portals, etc.).

      • List of most relevant domains, covering at least the top 10% by volume. For SMBs, this requirement is adjusted to 5% or a maximum of 1,000 domains, whichever is less.

  5. Users data:

    • Information generated through interaction with the model or with other provider services.

    • It must indicate which services contribute and the modality of the data (text, image, audio, etc.).

  6. Synthetic data:

    • Data created by or for the supplier using other AI models (e.g., model distillation or reinforcement with human feedback - RLHF).

    • Where appropriate, the generator model should be identified if it is available in the market.

Additional category – OtherIncludes data that does not fit into the above categories, such as offline sources, self-digitization, manual tagging, or human generation.

3. Aspects of data processing

It focuses on how data has been handled before and during training, with a particular focus on legal compliance:

  • Respect for Text and Data Mining (TDM): measures taken to honour the right of exclusion provided for in Article 4(3) of Directive 2019/790 on copyright, which allows rightholders to prevent the mining of texts and data. This right is exercised through opt-out protocols, such as tags in files or configurations in robots.txt, that indicate that certain content cannot be used to train models. Vendors should explain how they have identified and respected these opt-outs in their own datasets and in those purchased from third parties.

  • Removal of illegal content: procedures used to prevent or debug content that is illegal under EU law, such as child sexual abuse material, terrorist content or serious intellectual property infringements. These mechanisms may include blacklisting, automatic classifiers, or human review, but without revealing trade secrets.

The following diagram summarizes these three sections:

Template for the Public Summary of Training Content: essential information to be disclosed about the data used to train general-purpose AI models marketed in the European Union.  General information  Identification of the supplier  Identification of the model and its versions  Date of placing the model on the market in the EU  Data modalities used (text, image, audio, video, or others)  Approximate size of data per modality  Language coverage  List of data sources  Public datasets  Licensed private datasets  Other unlicensed private data  Data obtained through web crawling/scraping  User data  Synthetic data  Additional category – Other (e.g., offline sources)  Aspects of data processing  Respect for reserved rights (Text and Data Mining, TDM)  Removal of illegal content  Source:  Template for the Public Summary of Training Content, European Commission (July 2025).

Balancing transparency and trade secrets

The European Commission has designed the template seeking a delicate balance: offering sufficient information to protect rights and promote transparency, without forcing the disclosure of information that could compromise the competitiveness of suppliers.

  • Public sources: the highest level of detail is required, including names and links to "large" datasets.

  • Private sources: a more limited level of detail is allowed, through general descriptions when the information is not public.

  • Web scraping: a summary list of domains is required, without the need to detail exact combinations.

  • User and synthetic data: the information is limited to confirming its use and describing the modality.

Thanks to this approach, the summary is "generally complete" in scope, but not "technically detailed", protecting both transparency  and  the intellectual and commercial property of companies.

Compliance, deadlines and penalties

Article 53 of the AI Act details the obligations of general-purpose model providers, most notably the publication of this summary of training data.

This obligation is complemented by other measures, such as:

  • Have a public copyright policy.

  • Implement risk assessment and mitigation processes, especially for models that may generate systemic risks.

  • Establish mechanisms for traceability and supervision of data and training processes.

Non-compliance can lead  to significant fines, up to €15 million or 3% of the company's annual global turnover, whichever is higher.

Next Steps for Suppliers

To adapt to this new obligation, providers should:

  1. Review internal data collection and management processes to ensure that necessary information is available and verifiable.

  2. Establish clear transparency and copyright policies, including protocols to respect the right of exclusion in text and data mining (TDM).

  3. Publish the abstract on official channels before the corresponding deadline.

  4. Update the summary periodically, at least every six months or when there are material changes in training.

The European Commission, through the European AI Office, will monitor compliance and may request corrections or impose sanctions.

A key tool for governing data

In our previous article, "Governing Data to Govern Artificial Intelligence", we highlighted that reliable AI is only possible if there is a solid governance of data.

This new template reinforces that principle, offering a standardized mechanism for describing the lifecycle of data, from source to processing, and encouraging interoperability and responsible reuse.

This is a decisive step towards a more transparent, fair and  aligned AI with European values, where the protection of rights and technological innovation can advance together.

Conclusions

The publication of the Public Summary Template marks a historic milestone in the regulation of AI in Europe. By requiring providers to document and make public the data used in training, the European Union is taking a decisive step towards a more transparent and trustworthy artificial intelligence, based on responsibility and respect for fundamental rights. In a world where data is the engine of innovation, this tool becomes the key to governing data before governing AI, ensuring that technological development is built on trust and ethics.

Content created by Dr. Fernando Gualo, Professor at UCLM and Government and Data Quality Consultant. The content and views expressed in this publication are the sole responsibility of the author.

calendar icon
Blog

When dealing with the liability arising from the use of autonomous systems based on the use of artificial intelligence , it is common to refer to the ethical dilemmas that a traffic accident can pose. This example is useful to illustrate the problem of liability for damages caused by an accident or even to determine other types of liability in the field of road safety (for example, fines for violations of traffic rules).

Let's imagine that the autonomous vehicle has been driving at a higher speed than the permitted speed or that it has simply skipped a signal and caused an accident involving other vehicles. From the point of view of the legal risks, the liability that would be generated and, specifically, the impact of data in this scenario, we could ask some questions that help us understand the practical scope of this problem:

  • Have all the necessary datasets of sufficient quality to deal with traffic risks in different environments (rural, urban, dense cities, etc.) been considered in the design and training?

  • What is the responsibility if the accident is due to poor integration of the artificial intelligence tool with the vehicle or a failure of the manufacturer that prevents the correct reading of the signs?

  • Who is responsible if the problem stems from incorrect or outdated information on traffic signs?

In this post we are going to explain what aspects must be considered when assessing the liability that can be generated in this type of case.

The impact of data from the perspective of the subjects involved

In the design, training, deployment and use of artificial intelligence systems, the effective control of the data used plays an essential role in the management of legal risks. The conditions of its processing can have important consequences from the perspective of liability in the event of damage or non-compliance with the applicable regulations.

A rigorous approach to this problem requires distinguishing according to each of the subjects involved in the process, from its initial development to its effective use in specific circumstances, since the conditions and consequences can be very different. In this sense, it is necessary to identify the origin of the damage or non-compliance in order to impute the legal consequences to the person who should effectively be considered responsible:

  • Thus, damage or non-compliance may be determined by a design problem in the application used or in its training, so that certain data is misused for this purpose. Continuing with the example of an autonomous vehicle, this would be the case of accessing the data of the people traveling in it without consent.

  • However, it is also possible that the problem originates from the person who deploys the tool in each environment for real use, a position that would be occupied by the vehicle manufacturer. This could happen if, for its operation, data is accessed without the appropriate permissions or if there are restrictions that prevent access to the information necessary to guarantee its proper functioning.

  • The problem could also be generated by the person or entity using the tool itself. Returning to the example of the vehicle, it could be stated that the ownership of the vehicle corresponds to a company or an individual that has not carried out the necessary periodic inspections or updated the system when necessary.

  • Finally, there is the possibility that the legal problem of liability is determined by the conditions under which the data are provided at their original source. For example, if the data is inaccurate: the information about the road on which the vehicle is traveling is not up to date or the data emitted by traffic signs is not sufficiently accurate.

Challenges related to the technological environment: complexity and opacity

In addition, the very uniqueness of the technology used may significantly condition the attribution of liability. Specifically, technological opacity – that is, the difficulty in understanding why a system makes a specific decision – is one of the main challenges when it comes to addressing the legal challenges posed by artificial intelligence, as it makes it difficult to determine the responsible subject. This is a problem that acquires special importance with regard to the lawful origin of the data and, likewise, the conditions under which its processing takes place. In fact, this was precisely the main stumbling block that generative artificial intelligence encountered in the initial moments of its landing in Europe: the lack of adequate conditions of transparency regarding the processing of personal data justified the temporary halt of its commercialization until the necessary adjustments were made.

In this sense, the publication of the data used for the training phase becomes an additional guarantee from the perspective of legal certainty and, specifically, to verify the regulatory compliance conditions of the tool.

On the other hand, the complexity inherent in this technology poses an additional difficulty in terms of the imputation of the damage that may be caused and, consequently, in the determination of who should pay for it. Continuing with the example of the autonomous vehicle, it could be the case that various causes overlap, such as the inaccuracy of the data provided by traffic signs and, at the same time, a malfunction of the computer application by not detecting potential inconsistencies between the data used and its actual needs.

What does the regulation of the European Regulation on artificial intelligence say about it?

Regulation (EU) 2024/1689 establishes a harmonised regulatory framework across the European Union in relation to artificial intelligence. With regard to data, it includes some specific obligations for systems classified as "high risk", which are those contemplated in Article 6 and in the list in Annex III (biometric identification, education, labour management, access to essential services, etc.). In this sense, it incorporates a strict regime of technical requirements, transparency, supervision and auditing, combined with conformity assessment procedures prior to its commercialization and post-market control mechanisms, also establishing precise responsibilities for suppliers, operators and other actors in the value chain.

As regards data governance, a risk management system  should be put in place covering the entire lifecycle of the tool and assessing, mitigating, monitoring and documenting risks to health, safety and fundamental rights. Specifically, training, validation, and testing datasets are required to be:

  • Relevant, representative, complete and as error-free as possible for the intended purpose.

  • Managed in accordance with strict governance practices that mitigate bias and discrimination, especially when they may affect the fundamental rights of vulnerable or minority groups.

  • The Regulation also lays down strict conditions for the exceptional use of special categories of personal data with regard to the detection and, where appropriate, correction of bias.

With regard to technical documentation and record keeping, the following are required:

  • The preparation and maintenance of exhaustive technical documentation. In particular, with regard to transparency, complete and clear instructions for use should be provided,      including information on data and output results, among other things.

  • Systems should allow for the automatic recording of relevant events (logs) throughout their life cycle to ensure traceability and facilitate post-market surveillance, which can be very useful when checking the incidence of the data used.

As regards liability, that regulation is based on an approach that is admittedly limited from two points of view:

  • Firstly, it merely empowers Member States to establish a sanctioning regime that provides for the imposition of fines and other means of enforcement, such as warnings and non-pecuniary measures, which must be effective, proportionate and dissuasive of non-compliance with the regulation. They are, therefore, instruments of an administrative nature and punitive in nature, that is, punishment for non-compliance with the obligations established in said regulation, among which are those relating to data governance and the documentation and conservation of records referred to above.

  • However, secondly, the European regulator has not considered it appropriate to establish specific provisions regarding civil liability with the aim of compensating for the damage caused. This is an issue of great relevance that even led the European Commission to formulate a proposal for a specific Directive in 2022. Although its processing has not been completed, it has given rise to an interesting debate whose main arguments have been systematised in a comprehensive report by the European Parliament analysing the impact that this regulation could have.

No clear answers: open debate and regulatory developments

Thus, despite the progress made by the approval of the 2024 Regulation, the truth is that the regulation of liability arising from the use of artificial intelligence tools remains an open question on which there is no complete and developed regulatory framework. However, once the approach regarding the legal personification of robots that arose a few years ago has been overcome, it is unquestionable that artificial intelligence in itself cannot be considered a legally responsible subject.

As emphasized above, this is a complex debate in which it is not possible to offer simple and general answers, since it is essential to specify them in each specific case, taking into account the subjects that have intervened in each of the phases of design, implementation and use of the corresponding tool. It will therefore be these subjects who will have to assume the corresponding responsibility, either for the compensation of the damage caused or, where appropriate, to face the sanctions and other administrative measures in the event of non-compliance with the regulation.

In short, although the European regulation on artificial intelligence of 2024 may be useful to establish standards that help determine when a damage caused is contrary to law and, therefore, must be compensated, the truth is that it is an unclosed debate that will have to be redirected applying the general rules on consumer protection or defective products, taking into account the singularities of this technology. And, as far as administrative responsibility is concerned, it will be necessary to wait for the initiative that was announced a few months ago  and that is pending formal approval by the Council of Ministers for its subsequent parliamentary processing in the Spanish Parliament.

Content prepared by Julián Valero, Professor at the University of Murcia and Coordinator of the Research Group "Innovation, Law and Technology" (iDerTec). The contents and points of view reflected in this publication are the sole responsibility of its author.

calendar icon
Blog

Open data from public sources has evolved over the years, from being simple repositories of information to constituting dynamic ecosystems that can transform public governance. In this context, artificial intelligence (AI) emerges as a catalytic technology that benefits from the value of open data and exponentially enhances its usefulness. In this post we will see what the mutually beneficial symbiotic relationship between AI and open data looks like.

Traditionally, the debate on open data has focused on portals: the platforms on which governments publish information so that citizens, companies and organizations can access it. But the so-called "Third Wave of Open Data," a term by New York University's GovLab, emphasizes that it is no longer enough to publish datasets on demand or by default. The important thing is to think about the entire ecosystem: the life cycle of data, its exploitation, maintenance and, above all, the value it generates in society.

What role can open data play in AI?

In this context, AI appears as a catalyst capable of automating tasks, enriching open government data (DMOs), facilitating its understanding and stimulating collaboration between actors.

Recent research, developed by European universities, maps how this silent revolution is happening. The study proposes a classification of uses according to two dimensions:

  1. Perspective, which in turn is divided into two possible paths:
    1. Inward-looking (portal): The focus is on the internal functions of data portals.
    2. Outward-looking (ecosystem): the focus is extended to interactions with external actors (citizens, companies, organizations).
  2. Phases of the data life cycle, which can be divided into pre-processing, exploration, transformation and maintenance.

In summary, the report identifies these eight types of AI use in government open data, based on perspective and phase in the data lifecycle.

Table showing how AI supports different roles across the Open Government Data (OGD) lifecycle from two perspectives: "Inward Looking – Portal Perspective" and "Outward Looking – Ecosystem Perspective." The lifecycle stages are Pre-processing, Exploration, Transformation, and Maintenance.  Portal Perspective:  Pre-processing: AI acts as a Portal Curator, automating tagging, labeling, categorization, quality checks, and anonymization of sensitive data using privacy risk models.  Exploration: AI serves as a Portal Explorer, enhancing user access to data via natural language interfaces, improved search, and chatbots.  Transformation: AI functions as a Portal Linker, converting and aggregating data, such as transforming it into Linked OGD using machine learning.  Maintenance: AI operates as a Portal Monitor, ensuring compliance with standards, detecting anomalies, and improving metadata quality.  Ecosystem Perspective:  Pre-processing: AI is an Ecosystem Data Retriever, sourcing external data (e.g., legal texts) to enrich the OGD ecosystem.  Exploration: AI acts as an Ecosystem Connector, linking stakeholders by recommending relevant datasets and assets.  Transformation: AI becomes an Ecosystem Value Developer, enabling services and products using OGD, such as AI-powered dashboards and analytics suggestions.  Maintenance: AI serves as an Ecosystem Engager, encouraging ongoing stakeholder participation by predicting portal usage and enhancing engagement.

Figure 1. Eight uses of AI to improve government open data. Source: presentation “Data for AI or AI for data: artificial intelligence as a catalyst for open government ecosystems”, based on the report of the same name, from EU Open Data Days 2025.

Each of these uses is detailed below:

1. Portal curator

This application focuses on pre-processing data within the portal. AI helps organize, clean, anonymize, and tag datasets before publication. Some examples of tasks are:

  • Automation and improvement of data publication tasks.
  • Performing auto-tagging and categorization functions.
  • Data anonymization to protect privacy.
  • Automatic cleaning and filtering of datasets.
  • Feature extraction and missing data handling.

2. Ecosystem data retriever

Also in the pre-processing phase, but with an external focus, AI expands the coverage of portals by identifying and collecting information from diverse sources. Some tasks are:

  • Retrieve structured data from legal or regulatory texts.
  • News mining to enrich datasets with contextual information.
  • Integration of urban data from sensors or digital records.
  • Discovery and linking of heterogeneous sources.
  • Conversion of complex documents into structured information.

3. Portal explorer

In the exploration phase, AI systems can also make it easier to find and interact with published data, with a more internal approach. Some use cases:

  • Develop semantic search engines to locate datasets.
  • Implement chatbots that guide users in data exploration.
  • Provide natural language interfaces for direct queries.
  • Optimize the portal's internal search engines.
  • Use language models to improve information retrieval.

4. Ecosystem connector

Operating also in the exploration phase, AI acts as a bridge between actors and ecosystem resources. Some examples are:

  • Recommend relevant datasets to researchers or companies.
  • Identify potential partners based on common interests.
  • Extract emerging themes to support policymaking.
  • Visualize data from multiple sources in interactive dashboards.
  • Personalize data suggestions based on social media activity.

5. Portal linker

This functionality focuses on the transformation of data within the portal. Its function is to facilitate the combination and presentation of information for different audiences. Some tasks are:

  • Convert data into knowledge graphs (structures that connect related information, known as Linked Open Data).
  • Summarize and simplify data with NLP (Natural Language Processing) techniques.
  • Apply automatic reasoning to generate derived information.
  • Enhance multivariate visualization of complex datasets.
  • Integrate diverse data into accessible information products.

6. Ecosystem value developer

In the transformation phase and with an external perspective, AI generates products and services based on open data that provide added value. Some tasks are:

  • Suggest appropriate analytical techniques based on the type of dataset.
  • Assist in the coding and processing of information.
  • Create dashboards based on predictive analytics.
  • Ensure the correctness and consistency of the transformed data.
  • Support the development of innovative digital services.

7. Portal monitor

It focuses on portal maintenance, with an internal focus. Their role is to ensure quality, consistency, and compliance with standards. Some tasks are:

  • Detect anomalies and outliers in published datasets.
  • Evaluate the consistency of metadata and schemas.
  • Automate data updating and purification processes.
  • Identify incidents in real time for correction.
  • Reduce maintenance costs through intelligent monitoring.

8. Ecosystem engager

And finally, this function operates in the maintenance phase, but outwardly. It seeks to promote citizen participation and continuous interaction. Some tasks are:

  • Predict usage patterns and anticipate user needs.
  • Provide personalized feedback on datasets.
  • Facilitate citizen auditing of data quality.
  • Encourage participation in open data communities.
  • Identify user profiles to design more inclusive experiences.

What does the evidence tell us?

The study is based on a review of more than 70 academic papers examining the intersection between AI and OGD (open government data). From these cases, the authors observe that:

  • Some of the defined profiles, such as portal curator, portal explorer and portal monitor, are relatively mature and have multiple examples in the literature.
  • Others, such as ecosystem value developer and ecosystem engager, are less explored, although they have the most potential to generate social and economic impact.
  • Most applications today focus on automating specific tasks, but there is a lot of scope to design more comprehensive architectures, combining several types of AI in the same portal or across the entire data lifecycle.

From an academic point of view, this typology provides a common language and conceptual structure to study the relationship between AI and open data. It allows identifying gaps in research and guiding future work towards a more systemic approach.

In practice, the framework is useful for:

  • Data portal managers: helps them identify what types of AI they can implement according to their needs, from improving the quality of datasets to facilitating interaction with users.
  • Policymakers: guides them on how to design AI adoption strategies in open data initiatives, balancing efficiency, transparency, and participation.
  • Researchers and developers: it offers them a map of opportunities to create innovative tools that address specific ecosystem needs.

Limitations and next steps of the synergy between AI and open data

In addition to the advantages, the study recognizes some pending issues that, in a way, serve as a roadmap for the future. To begin with, several of the applications that have been identified are still in early stages or are conceptual. And, perhaps most relevantly, the debate on the risks and ethical dilemmas of the use of AI in open data has not yet been addressed in depth: bias, privacy, technological sustainability.

In short, the combination of AI and open data is still a field under construction, but with enormous potential. The key will be to move from isolated experiments to comprehensive strategies, capable of generating social, economic and democratic value. AI, in this sense, does not work independently of open data: it multiplies it and makes it more relevant for governments, citizens and society in general.

calendar icon
Evento

For the first time in the history of the organization, Spain will host the Global Summit of the Open Government Partnership (OGP), an international institution of reference in open government and citizen participation. From 6 to 10 October 2025, Vitoria-Gasteiz will become the world capital of open government, welcoming more than 2,000 representatives of governments, civil society organisations and public policy experts from all over the world.

Although registration for the Summit is now closed due to high demand, citizens will be able to follow some of the plenary sessions through online broadcasts and participate in the debates through social networks. In addition, the results and commitments arising from the Summit will be available on the OGP and Government of Spain digital platforms.

In this post, we review the objective, program of activities and more information of interest.

Program of activities of a global event

The OGP Global Summit 2025 will take place at the Europa Conference Centre in Vitoria-Gasteiz, where an ambitious agenda will be developed aligned with the Co-Presidency Programme of the Government of Spain and the Philippine organisation Bankay Kita, Cielo Magno. This agenda is structured around three fundamental thematic axes:

  • People: Activities that address the protection of civic space, the strengthening of democracy, and balancing the contributions of government, civil society, and the private sector. This axis seeks to ensure that all social actors have a voice in democratic processes.

  • Institutions: This block will address the participation of all branches of government to improve transparency, accountability, and citizen participation at all levels of government.

  • Technology and data: It will explore digital rights, social media governance, and internet freedom, as well as promoting digital civic space and freedom of expression in the digital age.

The OGP Summit's programming includes high-level plenary sessions, specialized workshops, side events, and networking spaces  that will facilitate knowledge sharing and alliance building. You can check the full program here, among the highlights are:

  • Artificial intelligence and open government: the participatory governance of AI and how to ensure that technological development respects democratic principles and human rights will be discussed.

  • Algorithmic transparency: the need to make algorithmic systems used in public decision-making visible and understandable will be discussed.

  • Open Justice: It will explore how to strengthen the rule of law through more transparent and accessible judicial systems for citizens.

  • Inclusive participation: experiences will be shared on how to ensure that populations in vulnerable situations can effectively participate in democratic processes.

  • Open public procurement: best practices will be presented to make public spending more transparent and efficient through open procurement processes.

Among the most relevant sessions for the open data ecosystem, the one organized by Red.es "When AI meets open data" stands out, which will be held on the 8th at 9 a.m. Through a round table,  it will be shown how artificial intelligence and open data enhance each other. On the one hand, AI helps to get more out of open data, and on the other hand, this data is essential for training and improving AI systems.

In addition, at the same time, on Thursday 9, the presentation "From data to impact through public-private partnerships and sharing ecosystems" will be held, organized by the General Directorate of Data of the Ministry for Digital Transformation and Public Function. This session will address how public-private sector collaborations can maximize the value of data to make a real impact on society, exploring innovative models of data sharing that respect privacy and foster innovation.

A legacy of democratic transformation

The Vitoria-Gasteiz Summit adds to the tradition of the eight previous summits held in Canada, Georgia, Estonia, France, Korea, Mexico, the United Kingdom and Brazil. Each of these summits has contributed to strengthening the global open government movement, generating concrete commitments that have transformed the relationship between governments and citizens.

In this edition, the most promising and impactful reforms will be recognized through the Open Gov Awards, celebrating innovation and progress in open government globally. These awards highlight initiatives that have demonstrated a real impact on the lives of citizens and that can serve as an inspiration for other countries and territories.

Multi-stakeholder engagement and collaboration

A distinctive feature of OGP is its multi-stakeholder approach, which ensures that both governments and civil society organizations have a say in defining open government agendas. This Summit will be no exception, and will be attended by representatives of citizen organizations, academics, businessmen and activists working for a more participatory and transparent democracy.

At the same time, other events will be held that will complement the official agenda. These activities will address specific topics such as the protection of whistleblowers, youth participation or the integration of the gender perspective in public policies.

This year, the OGP Global Summit 2025 in Vitoria-Gasteiz aims to generate concrete commitments that strengthen democracy in the digital age. As determined by the Open Government Partnership, the participating countries would make new commitments in their national action plans, especially in areas such as the governance of artificial intelligence, the protection of digital civic space and the fight against disinformation.

In summary, the OGP 2025 Global Summit in Vitoria-Gasteiz marks a pivotal moment for the future of democracy. In a context of growing challenges for democratic institutions, this meeting reaffirms the importance of maintaining open, transparent and participatory governments as fundamental pillars of free and prosperous societies.

calendar icon