More transparency in AI: new template for documenting general-purpose model training data

Publication date 13/10/2025

Update date 14/10/2025

Description

Artificial Intelligence (AI) is transforming society, the economy and public services at an unprecedented speed. This revolution brings enormous opportunities, but also challenges related to ethics, security and the protection of fundamental rights. Aware of this, the European Union approved the Artificial Intelligence Act (AI Act), in force since August 1, 2024, which establishes a harmonized and pioneering framework for the development, commercialization and use of AI systems in the single market, fostering innovation while protecting citizens.

A particularly relevant area of this regulation is general-purpose AI models (GPAI), such as large language models (LLMs) or multimodal models, which are trained on huge volumes of data from a wide variety of sources (text, images and video, audio and even user-generated data). This reality poses critical challenges in intellectual property, data protection and transparency on the origin and processing of information.

To address them, the European Commission, through the European AI Office, has published the Template for the Public Summary of Training Content for general-purpose AI models: a standardized format that providers will be required to complete and publish to summarize key information about the data used in training. From 2 August 2025, any general-purpose model placed on the market or distributed in the EU must be accompanied by this summary; models already on the market have until 2 August 2027 to adapt. This measure materializes the AI Act's principle of transparency and aims to shed light on the "black boxes" of AI.

In this article, we explain this template keys´s: from its objectives and structure, to information on deadlines, penalties, and next steps.

Objectives and relevance of the template

General-purpose AI models are trained on data from a wide variety of sources and modalities, such as:

Text: books, scientific articles, press, social networks.
Images and videos: digital content from the Internet and visual collections.
Audio: recordings, podcasts, radio programs, or conversations.
User data: information generated in interaction with the model itself or with other services of the provider.

This process of mass data collection is often opaque, raising concerns among rights holders, users, regulators, and society as a whole. Without transparency, it is difficult to assess whether data has been obtained lawfully, whether it includes unauthorised personal information or whether it adequately represents the cultural and linguistic diversity of the European Union.

Recital 107 of the AI Act states that the main objective of this template is to increase transparency and facilitate the exercise and protection of rights. Among the benefits it provides, the following stand out:

Intellectual property protection: allows authors, publishers and other rights holders to identify if their works have been used during training, facilitating the defense of their rights and a fair use of their content.
Privacy safeguard: helps detect whether personal data has been used, providing useful information so that affected individuals can exercise their rights under the General Data Protection Regulation (GDPR) and other regulations in the same field.
Prevention of bias and discrimination: provides information on the linguistic and cultural diversity of the sources used, key to assessing and mitigating biases that may lead to discrimination.
Fostering competition and research: reduces "black box" effects and facilitates academic scrutiny, while helping other companies better understand where data comes from, favoring more open and competitive markets.

In short, this template is not only a legal requirement, but a tool to build trust in artificial intelligence, creating an ecosystem in which technological innovation and the protection of rights are mutually reinforcing.

Template structure

The template, officially published on 24 July 2025 after a public consultation with more than 430 participating organisations, has been designed so that the information is presented in a clear, homogeneous and understandable way, both for specialists and for the public.

It consists of three main sections, ranging from basic model identification to legal aspects related to data processing.

1. General information

It provides a global view of the provider, the model, and the general characteristics of the training data:

Identification of the supplier, such as name and contact details.
Identification of the model and its versions, including dependencies if it is a modification (fine-tuning) of another model.
Date of placing the model on the market in the EU.
Data modalities used (text, image, audio, video, or others).
Approximate size of data by modality, expressed in wide ranges (e.g., less than 1 billion tokens, between 1 billion and 10 trillion, more than 10 trillion).
Language coverage, with special attention to the official languages of the European Union.

This section provides a level of detail sufficient to understand the extent and nature of the training, without revealing trade secrets.

2. List of data sources

It is the core of the template, where the origin of the training data is detailed. It is organized into six main categories, plus a residual category (other).

Public datasets:
- Data that is freely available and downloadable as a whole or in blocks (e.g., open data portals, common crawl, scholarly repositories).
- "Large" sets must be identified, defined as those that represent more than 3% of the total public data used in a specific modality.
Licensed private sets:
- Data obtained through commercial agreements with rights holders or their representatives, such as licenses with publishers for the use of digital books.
- A general description is provided only.
Other unlicensed private data:
- Databases acquired from third parties that do not directly manage copyright.
- If they are publicly known, they must be listed; otherwise, a general description (data type, nature, languages) is sufficient.
Data obtained through web crawling/scraping:
- Information collected by or on behalf of the supplier using automated tools.
- It must be specified:
  - Name/identifier of the trackers.
  - Purpose and behavior (respect for robots.txt, captchas, paywalls, etc.).
  - Collection period.
  - Types of websites (media, social networks, blogs, public portals, etc.).
  - List of most relevant domains, covering at least the top 10% by volume. For SMBs, this requirement is adjusted to 5% or a maximum of 1,000 domains, whichever is less.
Users data:
- Information generated through interaction with the model or with other provider services.
- It must indicate which services contribute and the modality of the data (text, image, audio, etc.).
Synthetic data:
- Data created by or for the supplier using other AI models (e.g., model distillation or reinforcement with human feedback - RLHF).
- Where appropriate, the generator model should be identified if it is available in the market.

Additional category – Other: Includes data that does not fit into the above categories, such as offline sources, self-digitization, manual tagging, or human generation.

3. Aspects of data processing

It focuses on how data has been handled before and during training, with a particular focus on legal compliance:

Respect for Text and Data Mining (TDM): measures taken to honour the right of exclusion provided for in Article 4(3) of Directive 2019/790 on copyright, which allows rightholders to prevent the mining of texts and data. This right is exercised through opt-out protocols, such as tags in files or configurations in robots.txt, that indicate that certain content cannot be used to train models. Vendors should explain how they have identified and respected these opt-outs in their own datasets and in those purchased from third parties.
Removal of illegal content: procedures used to prevent or debug content that is illegal under EU law, such as child sexual abuse material, terrorist content or serious intellectual property infringements. These mechanisms may include blacklisting, automatic classifiers, or human review, but without revealing trade secrets.

The following diagram summarizes these three sections:

Balancing transparency and trade secrets

The European Commission has designed the template seeking a delicate balance: offering sufficient information to protect rights and promote transparency, without forcing the disclosure of information that could compromise the competitiveness of suppliers.

Public sources: the highest level of detail is required, including names and links to "large" datasets.
Private sources: a more limited level of detail is allowed, through general descriptions when the information is not public.
Web scraping: a summary list of domains is required, without the need to detail exact combinations.
User and synthetic data: the information is limited to confirming its use and describing the modality.

Thanks to this approach, the summary is "generally complete" in scope, but not "technically detailed", protecting both transparency and the intellectual and commercial property of companies.

Compliance, deadlines and penalties

Article 53 of the AI Act details the obligations of general-purpose model providers, most notably the publication of this summary of training data.

This obligation is complemented by other measures, such as:

Have a public copyright policy.
Implement risk assessment and mitigation processes, especially for models that may generate systemic risks.
Establish mechanisms for traceability and supervision of data and training processes.

Non-compliance can lead to significant fines, up to €15 million or 3% of the company's annual global turnover, whichever is higher.

Next Steps for Suppliers

To adapt to this new obligation, providers should:

Review internal data collection and management processes to ensure that necessary information is available and verifiable.
Establish clear transparency and copyright policies, including protocols to respect the right of exclusion in text and data mining (TDM).
Publish the abstract on official channels before the corresponding deadline.
Update the summary periodically, at least every six months or when there are material changes in training.

The European Commission, through the European AI Office, will monitor compliance and may request corrections or impose sanctions.

A key tool for governing data

In our previous article, "Governing Data to Govern Artificial Intelligence", we highlighted that reliable AI is only possible if there is a solid governance of data.

This new template reinforces that principle, offering a standardized mechanism for describing the lifecycle of data, from source to processing, and encouraging interoperability and responsible reuse.

This is a decisive step towards a more transparent, fair and aligned AI with European values, where the protection of rights and technological innovation can advance together.

Conclusions

The publication of the Public Summary Template marks a historic milestone in the regulation of AI in Europe. By requiring providers to document and make public the data used in training, the European Union is taking a decisive step towards a more transparent and trustworthy artificial intelligence, based on responsibility and respect for fundamental rights. In a world where data is the engine of innovation, this tool becomes the key to governing data before governing AI, ensuring that technological development is built on trust and ethics.

Content created by Dr. Fernando Gualo, Professor at UCLM and Government and Data Quality Consultant. The content and views expressed in this publication are the sole responsibility of the author.

Legislación y justicia