buenas prácticas

Three strategies to get the most out of your AI-powered summaries

Blog

We live surrounded by AI-generated summaries. We have had the option of generating them for months, but now they are imposed on digital platforms as the first content that our eyes see when using a search engine or opening an email thread. On platforms such as Microsoft Teams or Google Meet, video call meetings are transcribed and summarized in automatic minutes for those who have not been able to be present, but also for those who have been there. However, what a language model has considered important, is it really important for the person receiving the summary?

In this new context, the key is to learn to recover the meaning behind so much summarized information. These three strategies will help you transform automatic content into an understanding and decision-making tool.

1. Ask expansive questions

We tend to summarize to reduce content that we are not able to cover, but we run the risk of associating brief with significant, an equivalence that is not always fulfilled. Therefore, we should not focus from the beginning on summarizing, but on extracting relevant information for us, our context, our vision of the situation and our way of thinking. Beyond the basic prompt "give me a summary", this new way of approaching content that escapes us consists of cross-referencing data, connecting dots and suggesting hypotheses, which they call sensemaking. And it happens, first of all, to be clear about what we want to know.

Practical situation:

Imagine a long meeting that we have not been able to attend. That afternoon, we received in our email a summary of the topics discussed. It's not always possible, but a good practice at this point, if our organization allows it, is not to just stay with the summary: if allowed, and always respecting confidentiality guidelines, upload the full transcript to a conversational system such as Copilot or Gemini and ask specific questions:

Which topic was repeated the most or received the most attention during the meeting?
In a previous meeting, person X used this argument. Was it used again? Did anyone discuss it? Was it considered valid?
What premises, assumptions or beliefs are behind this decision that has been made?
At the end of the meeting, what elements seem most critical to the success of the project?
What signs anticipate possible delays or blockages? Which ones have to do with or could affect my team?

Beware of:

First of all, review and confirm the attributions. Generative models are becoming more and more accurate, but they have a great ability to mix real information with false or generated information. For example, they can attribute a phrase to someone who did not say it, relate ideas as cause and effect that were not really connected, and surely most importantly: assign tasks or responsibilities for next steps to someone who does not correspond.

2. Ask for structured content

Good summaries are not shorter, but more organized, and the written text is not the only format we can use. Look for efficiency and ask conversational systems to return tables, categories, decision lists or relationship maps. Form conditions thought: if you structure information well, you will understand it better and also transmit it better to others, and therefore you will go further with it.

Practical situation:

In this case, let's imagine that we received a long report on the progress of several internal projects of our company. The document has many pages with paragraphs descriptive of status, feedback, dates, unforeseen events, risks and budgets. Reading everything line by line would be impossible and we would not retain the information. The good practice here is to ask for a transformation of the document that is really useful to us. If possible, upload the report to the conversational system and request structured content in a demanding way and without skimping on details:

Organize the report in a table with the following columns: project, responsible, delivery date, status, and a final column that indicates if any unforeseen event has occurred or any risk has materialized. If all goes well, print in that column "CORRECT".
Generate a visual calendar with deliverables, their due dates, and assignees, starting on October 1, 2025 and ending on January 31, 2026, in the form of a Gantt chart.
I want a list that only includes the name of the projects, their start date, and their due date. Sort by delivery date, closest first.
From the customer feedback section that you will find in each project, create a table with the most repeated comments and which areas or teams they usually refer to. Place them in order, from the most repeated to the least.
Give me the billing of the projects that are at risk of not meeting deadlines, indicate the price of each one and the total.

Beware of:

The illusion of veracity and completeness that a clean, orderly, automatic text with fonts will provide us is enormous. A clear format, such as a table, list, or map, can give a false sense of accuracy. If the source data is incomplete or wrong, the structure only makes up the error and we will have a harder time seeing it. AI productions are usually almost perfect. At the very least, and if the document is very long, do random checks ignoring the form and focusing on the content.

3. Connect the dots

Strategic sense is rarely in an isolated text, let alone in a summary. The advanced level in this case consists of asking the multimodal chat to cross-reference sources, compare versions or detect patterns between various materials or formats, such as the transcript of a meeting, an internal report and a scientific article. What is really interesting to see are comparative keys such as evolutionary changes, absences or inconsistencies.

Practical situation:

Let's imagine that we are preparing a proposal for a new project. We have several materials: the transcript of a management team meeting, the previous year's internal report, and a recent article on industry trends. Instead of summarizing them separately, you can upload them to the same conversation thread or chat you've customized on the topic, and ask for more ambitious actions.

Compare these three documents and tell me which priorities coincide in all of them, even if they are expressed in different ways.
What topics in the internal report were not mentioned at the meeting? Generate a hypothesis for each one as to why they have not been treated.
What ideas in the article might reinforce or challenge ours? Give me ideas that are not reflected in our internal report.
Look for articles in the press from the last six months that support the strong ideas of the internal report.
Find external sources that complement the information missing in these three documents on topic X, and generate a panoramic report with references.

Beware of:

It is very common for AI systems to deceptively simplify complex discussions, not because they have a hidden purpose but because they have always been rewarded for simplicity and clarity in training. In addition, automatic generation introduces a risk of authority: because the text is presented with the appearance of precision and neutrality, we assume that it is valid and useful. And if that wasn't enough, structured summaries are copied and shared quickly. Before forwarding, make sure that the content is validated, especially if it contains sensitive decisions, names, or data.

AI-based models can help you visualize convergences, gaps, or contradictions and, from there, formulate hypotheses or lines of action. It is about finding with greater agility what is so valuable that we call insights. That is the step from summary to analysis: the most important thing is not to compress the information, but to select it well, relate it and connect it with the context. Intensifying the demand from the prompt is the most appropriate way to work with AI systems, but it also requires a previous personal effort of analysis and landing.

Content created by Carmen Torrijos, expert in AI applied to language and communication. The content and views expressed in this publication are the sole responsibility of the author.

03/11/2025

More transparency in AI: new template for documenting general-purpose model training data

Blog

Artificial Intelligence (AI) is transforming society, the economy and public services at an unprecedented speed. This revolution brings enormous opportunities, but also challenges related to ethics, security and the protection of fundamental rights. Aware of this, the European Union approved the Artificial Intelligence Act (AI Act), in force since August 1, 2024, which establishes a harmonized and pioneering framework for the development, commercialization and use of AI systems in the single market, fostering innovation while protecting citizens.

A particularly relevant area of this regulation is general-purpose AI models (GPAI), such as large language models (LLMs) or multimodal models, which are trained on huge volumes of data from a wide variety of sources (text, images and video, audio and even user-generated data). This reality poses critical challenges in intellectual property, data protection and transparency on the origin and processing of information.

To address them, the European Commission, through the European AI Office, has published the Template for the Public Summary of Training Content for general-purpose AI models: a standardized format that providers will be required to complete and publish to summarize key information about the data used in training. From 2 August 2025, any general-purpose model placed on the market or distributed in the EU must be accompanied by this summary; models already on the market have until 2 August 2027 to adapt. This measure materializes the AI Act's principle of transparency and aims to shed light on the "black boxes" of AI.

In this article, we explain this template keys´s: from its objectives and structure, to information on deadlines, penalties, and next steps.

Objectives and relevance of the template

General-purpose AI models are trained on data from a wide variety of sources and modalities, such as:

Text: books, scientific articles, press, social networks.
Images and videos: digital content from the Internet and visual collections.
Audio: recordings, podcasts, radio programs, or conversations.
User data: information generated in interaction with the model itself or with other services of the provider.

This process of mass data collection is often opaque, raising concerns among rights holders, users, regulators, and society as a whole. Without transparency, it is difficult to assess whether data has been obtained lawfully, whether it includes unauthorised personal information or whether it adequately represents the cultural and linguistic diversity of the European Union.

Recital 107 of the AI Act states that the main objective of this template is to increase transparency and facilitate the exercise and protection of rights. Among the benefits it provides, the following stand out:

Intellectual property protection: allows authors, publishers and other rights holders to identify if their works have been used during training, facilitating the defense of their rights and a fair use of their content.
Privacy safeguard: helps detect whether personal data has been used, providing useful information so that affected individuals can exercise their rights under the General Data Protection Regulation (GDPR) and other regulations in the same field.
Prevention of bias and discrimination: provides information on the linguistic and cultural diversity of the sources used, key to assessing and mitigating biases that may lead to discrimination.
Fostering competition and research: reduces "black box" effects and facilitates academic scrutiny, while helping other companies better understand where data comes from, favoring more open and competitive markets.

In short, this template is not only a legal requirement, but a tool to build trust in artificial intelligence, creating an ecosystem in which technological innovation and the protection of rights are mutually reinforcing.

Template structure

The template, officially published on 24 July 2025 after a public consultation with more than 430 participating organisations, has been designed so that the information is presented in a clear, homogeneous and understandable way, both for specialists and for the public.

It consists of three main sections, ranging from basic model identification to legal aspects related to data processing.

1. General information

It provides a global view of the provider, the model, and the general characteristics of the training data:

Identification of the supplier, such as name and contact details.
Identification of the model and its versions, including dependencies if it is a modification (fine-tuning) of another model.
Date of placing the model on the market in the EU.
Data modalities used (text, image, audio, video, or others).
Approximate size of data by modality, expressed in wide ranges (e.g., less than 1 billion tokens, between 1 billion and 10 trillion, more than 10 trillion).
Language coverage, with special attention to the official languages of the European Union.

This section provides a level of detail sufficient to understand the extent and nature of the training, without revealing trade secrets.

2. List of data sources

It is the core of the template, where the origin of the training data is detailed. It is organized into six main categories, plus a residual category (other).

Public datasets:
- Data that is freely available and downloadable as a whole or in blocks (e.g., open data portals, common crawl, scholarly repositories).
- "Large" sets must be identified, defined as those that represent more than 3% of the total public data used in a specific modality.
Licensed private sets:
- Data obtained through commercial agreements with rights holders or their representatives, such as licenses with publishers for the use of digital books.
- A general description is provided only.
Other unlicensed private data:
- Databases acquired from third parties that do not directly manage copyright.
- If they are publicly known, they must be listed; otherwise, a general description (data type, nature, languages) is sufficient.
Data obtained through web crawling/scraping:
- Information collected by or on behalf of the supplier using automated tools.
- It must be specified:
  - Name/identifier of the trackers.
  - Purpose and behavior (respect for robots.txt, captchas, paywalls, etc.).
  - Collection period.
  - Types of websites (media, social networks, blogs, public portals, etc.).
  - List of most relevant domains, covering at least the top 10% by volume. For SMBs, this requirement is adjusted to 5% or a maximum of 1,000 domains, whichever is less.
Users data:
- Information generated through interaction with the model or with other provider services.
- It must indicate which services contribute and the modality of the data (text, image, audio, etc.).
Synthetic data:
- Data created by or for the supplier using other AI models (e.g., model distillation or reinforcement with human feedback - RLHF).
- Where appropriate, the generator model should be identified if it is available in the market.

Additional category – Other: Includes data that does not fit into the above categories, such as offline sources, self-digitization, manual tagging, or human generation.

3. Aspects of data processing

It focuses on how data has been handled before and during training, with a particular focus on legal compliance:

Respect for Text and Data Mining (TDM): measures taken to honour the right of exclusion provided for in Article 4(3) of Directive 2019/790 on copyright, which allows rightholders to prevent the mining of texts and data. This right is exercised through opt-out protocols, such as tags in files or configurations in robots.txt, that indicate that certain content cannot be used to train models. Vendors should explain how they have identified and respected these opt-outs in their own datasets and in those purchased from third parties.
Removal of illegal content: procedures used to prevent or debug content that is illegal under EU law, such as child sexual abuse material, terrorist content or serious intellectual property infringements. These mechanisms may include blacklisting, automatic classifiers, or human review, but without revealing trade secrets.

The following diagram summarizes these three sections:

Balancing transparency and trade secrets

The European Commission has designed the template seeking a delicate balance: offering sufficient information to protect rights and promote transparency, without forcing the disclosure of information that could compromise the competitiveness of suppliers.

Public sources: the highest level of detail is required, including names and links to "large" datasets.
Private sources: a more limited level of detail is allowed, through general descriptions when the information is not public.
Web scraping: a summary list of domains is required, without the need to detail exact combinations.
User and synthetic data: the information is limited to confirming its use and describing the modality.

Thanks to this approach, the summary is "generally complete" in scope, but not "technically detailed", protecting both transparency and the intellectual and commercial property of companies.

Compliance, deadlines and penalties

Article 53 of the AI Act details the obligations of general-purpose model providers, most notably the publication of this summary of training data.

This obligation is complemented by other measures, such as:

Have a public copyright policy.
Implement risk assessment and mitigation processes, especially for models that may generate systemic risks.
Establish mechanisms for traceability and supervision of data and training processes.

Non-compliance can lead to significant fines, up to €15 million or 3% of the company's annual global turnover, whichever is higher.

Next Steps for Suppliers

To adapt to this new obligation, providers should:

Review internal data collection and management processes to ensure that necessary information is available and verifiable.
Establish clear transparency and copyright policies, including protocols to respect the right of exclusion in text and data mining (TDM).
Publish the abstract on official channels before the corresponding deadline.
Update the summary periodically, at least every six months or when there are material changes in training.

The European Commission, through the European AI Office, will monitor compliance and may request corrections or impose sanctions.

A key tool for governing data

In our previous article, "Governing Data to Govern Artificial Intelligence", we highlighted that reliable AI is only possible if there is a solid governance of data.

This new template reinforces that principle, offering a standardized mechanism for describing the lifecycle of data, from source to processing, and encouraging interoperability and responsible reuse.

This is a decisive step towards a more transparent, fair and aligned AI with European values, where the protection of rights and technological innovation can advance together.

Conclusions

The publication of the Public Summary Template marks a historic milestone in the regulation of AI in Europe. By requiring providers to document and make public the data used in training, the European Union is taking a decisive step towards a more transparent and trustworthy artificial intelligence, based on responsibility and respect for fundamental rights. In a world where data is the engine of innovation, this tool becomes the key to governing data before governing AI, ensuring that technological development is built on trust and ethics.

Content created by Dr. Fernando Gualo, Professor at UCLM and Government and Data Quality Consultant. The content and views expressed in this publication are the sole responsibility of the author.

13/10/2025

How to build a Green Deal Data Space that complies with the FAIR principles

Blog

To achieve its environmental sustainability goals, Europe needs accurate, accessible and up-to-date information that enables evidence-based decision-making. The Green Deal Data Space (GDDS) will facilitate this transformation by integrating diverse data sources into a common, interoperable and open digital infrastructure.

In Europe, work is being done on its development through various projects, which have made it possible to obtain recommendations and good practices for its implementation. Discover them in this article!

What is the Green Deal Data Space?

The Green Deal Data Space (GDDS) is an initiative of the European Commission to create a digital ecosystem that brings together data from multiple sectors. It aims to support and accelerate the objectives of the Green Deal: the European Union's roadmap for a sustainable, climate-neutral and fair economy. The pillars of the Green Deal include:

An energy transition that reduces emissions and improves efficiency.
The promotion of the circular economy, promoting the recycling, reuse and repair of products to minimise waste.
The promotion of more sustainable agricultural practices.
Restoring nature and biodiversity, protecting natural habitats and reducing air, water and soil pollution.
The guarantee of social justice, through a transition that makes it easier for no country or community to be left behind.

Through this comprehensive strategy, the EU aims to become the world's first competitive and resource-efficient economy, achieving net-zero greenhouse gas emissions by 2050. The Green Deal Data Space is positioned as a key tool to achieve these objectives. Integrated into the European Data Strategy, data spaces are digital environments that enable the reliable exchange of data, while maintaining sovereignty and ensuring trust and security under a set of mutually agreed rules.

In this specific case, the GDDS will integrate valuable data on biodiversity, zero pollution, circular economy, climate change, forest services, smart mobility and environmental compliance. This data will be easy to locate, interoperable, accessible and reusable under the FAIR (Findability, Accessibility, Interoperability, Reusability) principles.

The GDDS will be implemented through the SAGE (Dataspace for a Green and Sustainable Europe) project and will be based on the results of the GREAT (Governance of Responsible Innovation) initiative.

A report with recommendations for the GDDS

How we saw in a previous article, four pioneering projects are laying the foundations for this ecosystem: AD4GD, B-Cubed, FAIRiCUBE and USAGE. These projects, funded under the HORIZON call, have analysed and documented for several years the requirements necessary to ensure that the GDDS follows the FAIR principles. As a result of this work, the report "Policy Brief: Unlocking The Full Potential Of The Green Deal Data Space”. It is a set of recommendations that seek to serve as a guide to the successful implementation of the Green Deal Data Space.

The report highlights five major areas in which the challenges of GDDS construction are concentrated:

1. Data harmonization

Environmental data is heterogeneous, as it comes from different sources: satellites, sensors, weather stations, biodiversity registers, private companies, research institutes, etc. Each provider uses its own formats, scales, and methodologies. This causes incompatibilities that make it difficult to compare and combine data. To fix this, it is essential to:

Adopt existing international standards and vocabularies, such as INSPIRE, that span multiple subject areas.
Avoid proprietary formats, prioritizing those that are open and well documented.
Invest in tools that allow data to be easily transformed from one format to another.

2. Semantic interoperability

Ensuring semantic interoperability is crucial so that data can be understood and reused across different contexts and disciplines, which is critical when sharing data between communities as diverse as those participating in the Green Deal objectives. In addition, the Data Act requires participants in data spaces to provide machine-readable descriptions of datasets, thus ensuring their location, access, and reuse. In addition, it requires that the vocabularies, taxonomies and lists of codes used be documented in a public and coherent manner. To achieve this, it is necessary to:

Use linked data and metadata that offer clear and shared concepts, through vocabularies, ontologies and standards such as those developed by the OGC or ISO standards.
Use existing standards to organize and describe data and only create new extensions when really necessary.
Improve the already accepted international vocabularies, giving them more precision and taking advantage of the fact that they are already widely used by scientific communities.

3. Metadata and data curation

Data only reaches its maximum value if it is accompanied by clear metadata explaining its origin, quality, restrictions on use and access conditions. However, poor metadata management remains a major barrier. In many cases, metadata is non-existent, incomplete, or poorly structured, and is often lost when translated between non-interoperable standards. To improve this situation, it is necessary to:

Extend existing metadata standards to include critical elements such as observations, measurements, source traceability, etc.
Foster interoperability between metadata standards in use, through mapping and transformation tools that respond to both commercial and open data needs.
Recognize and finance the creation and maintenance of metadata in European projects, incorporating the obligation to generate a standardized catalogue from the outset in data management plans.

4. Data Exchange and Federated Provisioning

The GDDS does not only seek to centralize all the information in a single repository, but also to allow multiple actors to share data in a federated and secure way. Therefore, it is necessary to strike a balance between open access and the protection of rights and privacy. This requires:

Adopt and promote open and easy-to-use technologies that allow the integration between open and protected data, complying with the General Data Protection Regulation (GDPR).
Ensure the integration of various APIs used by data providers and user communities, accompanied by clear demonstrators and guidelines. However, the use of standardized APIs needs to be promoted to facilitate a smoother implementation, such as OGC (Open Geospatial Consortium) APIs for geospatial assets.
Offer clear specification and conversion tools to enable interoperability between APIs and data formats.

In parallel to the development of the Eclipse Dataspace Connectors (an open-source technology to facilitate the creation of data spaces), it is proposed to explore alternatives such as blockchain catalogs or digital certificates, following examples such as the FACTS (Federated Agile Collaborative Trusted System).

5. Inclusive and sustainable governance

The success of the GDDS will depend on establishing a robust governance framework that ensures transparency, participation, and long-term sustainability. It is not only about technical standards, but also about fair and representative rules. To make progress in this regard, it is key to:

Use only European clouds to ensure data sovereignty, strengthen security and comply with EU regulations, something that is especially important in the face of today's global challenges.
Integrating open platforms such as Copernicus, the European Data Portal and INSPIRE into the GDDS strengthens interoperability and facilitates access to public data. In this regard, it is necessary to design effective strategies to attract open data providers and prevent GDDS from becoming a commercial or restricted environment.
Mandating data in publicly funded academic journals increases its visibility, and supporting standardization initiatives strengthens the visibility of data and ensures its long-term maintenance.
Providing comprehensive training and promoting cross-use of harmonization tools prevents the creation of new data silos and improves cross-domain collaboration.

The following image summarizes the relationship between these blocks:

Conclusion

All these recommendations have an impact on a central idea: building a Green Deal Data Space that complies with the FAIR principles is not only a technical issue, but also a strategic and ethical one. It requires cross-sector collaboration, political commitment, investment in capacities, and inclusive governance that ensures equity and sustainability. If Europe succeeds in consolidating this digital ecosystem, it will be better prepared to meet environmental challenges with informed, transparent and common good-oriented decisions.

01/10/2025

How to build a citizen science initiative considering open data from the start

Blog

Citizen participation in the collection of scientific data promotes a more democratic science, by involving society in R+D+i processes and reinforcing accountability. In this sense, there are a variety of citizen science initiatives launched by entities such as CSIC, CENEAM or CREAF, among others. In addition, there are currently numerous citizen science platform platforms that help anyone find, join and contribute to a wide variety of initiatives around the world, such as SciStarter.

Some references in national and European legislation

Different regulations, both at national and European level, highlight the importance of promoting citizen science projects as a fundamental component of open science. For example, Organic Law 2/2023, of 22 March, on the University System, establishes that universities will promote citizen science as a key instrument for generating shared knowledge and responding to social challenges, seeking not only to strengthen the link between science and society, but also to contribute to a more equitable, inclusive and sustainable territorial development.

On the other hand, Law 14/2011, of 1 June, on Science, Technology and Innovation, promotes "the participation of citizens in the scientific and technical process through, among other mechanisms, the definition of research agendas, the observation, collection and processing of data, the evaluation of impact in the selection of projects and the monitoring of results, and other processes of citizen participation."

At the European level, Regulation (EU) 2021/695 establishing the Framework Programme for Research and Innovation "Horizon Europe", indicates the opportunity to develop projects co-designed with citizens, endorsing citizen science as a research mechanism and a means of disseminating results.

Citizen science initiatives and data management plans

The first step in defining a citizen science initiative is usually to establish a research question that requires data collection that can be addressed with the collaboration of citizens. Then, an accessible protocol is designed for participants to collect or analyze data in a simple and reliable way (it could even be a gamified process). Training materials must be prepared and a means of participation (application, web or even paper) must be developed. It also plans how to communicate progress and results to citizens, encouraging their participation.

As it is an intensive activity in data collection, it is interesting that citizen science projects have a data management plan that defines the life cycle of data in research projects, that is, how data is created, organized, shared, reused and preserved in citizen science initiatives. However, most citizen science initiatives do not have such a plan: this recent research article found that only 38% of the citizen science projects consulted had a data management plan.

Figure 1. Data life cycle in citizen science projects Source: own elaboration – datos.gob.es.

On the other hand, data from citizen science only reach their full potential when they comply with the FAIR principles and are published in open access. In order to help have this data management plan that makes data from citizen science initiatives FAIR, it is necessary to have specific standards for citizen science such as PPSR Core.

Open Data for Citizen Science with the PPSR Core Standard

The publication of open data should be considered from the early stages of a citizen science project, incorporating the PPSR Core standard as a key piece. As we mentioned earlier, when research questions are formulated, in a citizen science initiative, a data management plan must be proposed that indicates what data to collect, in what format and with what metadata, as well as the needs for cleaning and quality assurance from the data collected by citizens. in addition to a publication schedule.

Then, it must be standardized with PPSR (Public Participation in Scientific Research) Core. PPSR Core is a set of data and metadata standards, specially designed to encourage citizen participation in scientific research processes. It has a three-layer architecture based on a Common Data Model (CDM). This CDM helps to organize in a coherent and connected way the information about citizen science projects, the related datasets and the observations that are part of them, in such a way that the CDM facilitates interoperability between citizen science platforms and scientific disciplines. This common model is structured in three main layers that allow the key elements of a citizen science project to be described in a structured and reusable way. The first is the Project Metadata Model (PMM), which collects the general information of the project, such as its objective, participating audience, location, duration, responsible persons, sources of funding or relevant links. Second, the Dataset Metadata Model (DMM) documents each dataset generated, detailing what type of information is collected, by what method, in what period, under what license and under what conditions of access. Finally, the Observation Data Model (ODM) focuses on each individual observation made by citizen science initiative participants, including the date and location of the observation and the result. It is interesting to note that this PPSR-Core layer model allows specific extensions to be added according to the scientific field, based on existing vocabularies such as Darwin Core (biodiversity) or ISO 19156 (sensor measurements). (ODM) focuses on each individual observation made by participants of the citizen science initiative, including the date and place of the observation and the outcome. It is interesting to note that this PPSR-Core layer model allows specific extensions to be added according to the scientific field, based on existing vocabularies such as Darwin Core (biodiversity) or ISO 19156 (sensor measurements).

Figure 2. PPSR CORE layering architecture. Source: own elaboration – datos.gob.es.

This separation allows a citizen science initiative to automatically federate the project file (PMM) with platforms such as SciStarter, share a dataset (DMM) with a institutional repository of open scientific data, such as those added in FECYT's RECOLECTA and, at the same time, send verified observations (ODMs) to a platform such as GBIF without redefining each field.

In addition, the use of PPSR Core provides a number of advantages for the management of the data of a citizen science initiative:

Greater interoperability: platforms such as SciStarter already exchange metadata using PMM, so duplication of information is avoided.
Multidisciplinary aggregation: ODM profiles allow datasets from different domains (e.g. air quality and health) to be united around common attributes, which is crucial for multidisciplinary studies.
Alignment with FAIR principles: The required fields of the DMM are useful for citizen science datasets to comply with the FAIR principles.

It should be noted that PPSR Core allows you to add context to datasets obtained in citizen science initiatives. It is a good practice to translate the content of the PMM into language understandable by citizens, as well as to obtain a data dictionary from the DMM (description of each field and unit) and the mechanisms for transforming each record from the MDG. Finally, initiatives to improve PPSR Core can be highlighted, for example, through a DCAT profile for citizen science.

Conclusions

Planning the publication of open data from the beginning of a citizen science project is key to ensuring the quality and interoperability of the data generated, facilitating its reuse and maximizing the scientific and social impact of the project. To this end, PPSR Core offers a level-based standard (PMM, DMM, ODM) that connects the data generated by citizen science with various platforms, promoting that this data complies with the FAIR principles and considering, in an integrated way, various scientific disciplines. With PPSR Core , every citizen observation is easily converted into open data on which the scientific community can continue to build knowledge for the benefit of society.

Jose Norberto Mazón, Professor of Computer Languages and Systems at the University of Alicante. The contents and views reflected in this publication are the sole responsibility of the author.

20/08/2025

Thinking Out Loud: Prompts to Simulate Human Reasoning with AI

Blog

In the usual search for tricks to make our prompts more effective, one of the most popular is the activation of the chain of thought. It consists of posing a multilevel problem and asking the AI system to solve it, but not by giving us the solution all at once, but by making visible step by step the logical line necessary to solve it. This feature is available in both paid and free AI systems, it's all about knowing how to activate it.

Originally, the reasoning string was one of many tests of semantic logic that developers put language models through. However, in 2022, Google Brain researchers demonstrated for the first time that providing examples of chained reasoning in the prompt could unlock greater problem-solving capabilities in models.

From this moment on, little by little, it has positioned itself as a useful technique to obtain better results from use, being very questioned at the same time from a technical point of view. Because what is really striking about this process is that language models do not think in a chain: they are only simulating human reasoning before us.

How to activate the reasoning chain

There are two possible ways to activate this process in the models: from a button provided by the tool itself, as in the case of DeepSeek with the "DeepThink" button that activates the R1 model:

Graphical User Interface, Application
AI-generated content may be incorrect.

Figure 1. DeepSeek with the "DeepThink" button that activates the R1 model.

Or, and this is the simplest and most common option, from the prompt itself. If we opt for this option, we can do it in two ways: only with the instruction (zero-shot prompting) or by providing solved examples (few-shot prompting).

Zero-shot prompting: as simple as adding at the end of the prompt an instruction such as "Reason step by step", or "Think before answering". This assures us that the chain of reasoning will be activated and we will see the logical process of the problem visible.

Graphical User Interface, Text, Application
AI-generated content may be incorrect.

Figure 2. Example of Zero-shot prompting.

Few-shot prompting: if we want a very precise response pattern, it may be interesting to provide some solved question-answer examples. The model sees this demonstration and imitates it as a pattern in a new question.

Text, Application, Letter
AI-generated content may be incorrect.

Figure 3. Example of Few-shot prompting.

Benefits and three practical examples

When we activate the chain of reasoning, we are asking the system to "show" its work in a visible way before our eyes, as if it were solving the problem on a blackboard. Although not completely eliminated, forcing the language model to express the logical steps reduces the possibility of errors, because the model focuses its attention on one step at a time. In addition, in the event of an error, it is much easier for the user of the system to detect it with the naked eye.

When is the chain of reasoning useful? Especially in mathematical calculations, logical problems, puzzles, ethical dilemmas or questions with different stages and jumps (called multi-hop). In the latter, it is practical, especially in those in which you have to handle information from the world that is not directly included in the question.

Let's see some examples in which we apply this technique to a chronological problem, a spatial problem and a probabilistic problem.

Chronological reasoning

Let's think about the following prompt:

If Juan was born in October and is 15 years old, how old was he in June of last year?

Graphical User Interface, Text, Application
AI-generated content may be incorrect.

Figure 5. Example of chronological reasoning.

For this example we have used the GPT-o3 model, available in the Plus version of ChatGPT and specialized in reasoning, so the chain of thought is activated as standard and it is not necessary to do it from the prompt. This model is programmed to give us the information of the time it has taken to solve the problem, in this case 6 seconds. Both the answer and the explanation are correct, and to arrive at them the model has had to incorporate external information such as the order of the months of the year, the knowledge of the current date to propose the temporal anchorage, or the idea that age changes in the month of the birthday, and not at the beginning of the year.

Spatial reasoning
A person is facing north. Turn 90 degrees to the right, then 180 degrees to the left. In what direction are you looking now?

Figure 6. Example of spatial reasoning.

This time we have used the free version of ChatGPT, which uses the GPT-4o model by default (although with limitations), so it is safer to activate the reasoning chain with an indication at the end of the prompt: Reason step by step. To solve this problem, the model needs general knowledge of the world that it has learned in training, such as the spatial orientation of the cardinal points, the degrees of rotation, laterality and the basic logic of movement.
Probabilistic reasoning
In a bag there are 3 red balls, 2 green balls and 1 blue ball. If you draw a ball at random without looking, what's the probability that it's neither red nor blue?

Figure 7. Example of probabilistic reasoning.

To launch this prompt we have used Gemini 2.5 Flash, in the Gemini Pro version of Google. The training of this model was certainly included in the fundamentals of both basic arithmetic and probability, but the most effective for the model to learn to solve this type of exercise are the millions of solved examples it has seen. Probability problems and their step-by-step solutions are the model to imitate when reconstructing this reasoning.

The Great Simulation

And now, let's go with the questioning. In recent months, the debate about whether or not we can trust these mock explanations has grown, especially since, ideally, the chain of thought should faithfully reflect the internal process by which the model arrives at its answer. And there is no practical guarantee that this will be the case.

The Anthropic team (creators of Claude, another great language model) has carried out a trap experiment with Claude Sonnet in 2025, to which they suggested a key clue for the solution before activating the reasoned response.

Think of it like passing a student a note that says "the answer is [A]" before an exam. If you write on your exam that you chose [A] at least in part because of the grade, that's good news: you're being honest and faithful. But if you write down what claims to be your reasoning process without mentioning the note, we might have a problem.

The percentage of times Claude Sonnet included the track among his deductions was only 25%. This shows that sometimes models generate explanations that sound convincing, but that do not correspond to their true internal logic to arrive at the solution, but are rationalizations a posteriori: first they find the solution, then they invent the process in a coherent way for the user. This shows the risk that the model may be hiding steps or relevant information for the resolution of the problem.

Closing

Despite the limitations exposed, as we see in the study mentioned above, we cannot forget that in the original Google Brain research, it was documented that, when applying the reasoning chain, the PaLM model improved its performance in mathematical problems from 17.9% to 58.1% accuracy. If, in addition, we combine this technique with the search in open data to obtain information external to the model, the reasoning improves in terms of being more verifiable, updated and robust.

However, by making language models "think out loud", what we are really improving in 100% of cases is the user experience in complex tasks. If we do not fall into the excessive delegation of thought to AI, our own cognitive process can benefit. It is also a technique that greatly facilitates our new work as supervisors of automatic processes.

Content prepared by Carmen Torrijos, expert in AI applied to language and communication. The contents and points of view reflected in this publication are the sole responsibility of the author.

08/07/2025

Evaluate to trust: the key role of validation and open data in generative AI

Blog

Generative artificial intelligence is beginning to find its way into everyday applications ranging from virtual agents (or teams of virtual agents) that resolve queries when we call a customer service centre to intelligent assistants that automatically draft meeting summaries or report proposals in office environments.

These applications, often governed by foundational language models (LLMs), promise to revolutionise entire industries on the basis of huge productivity gains. However, their adoption brings new challenges because, unlike traditional software, a generative AI model does not follow fixed rules written by humans, but its responses are based on statistical patterns learned from processing large volumes of data. This makes its behaviour less predictable and more difficult to explain, and sometimes leads to unexpected results, errors that are difficult to foresee, or responses that do not always align with the original intentions of the system's creator.

Therefore, the validation of these applications from multiple perspectives such as ethics, security or consistency is essential to ensure confidence in the results of the systems we are creating in this new stage of digital transformation.

What needs to be validated in generative AI-based systems?

Validating generative AI-based systems means rigorously checking that they meet certain quality and accountability criteria before relying on them to solve sensitive tasks.

It is not only about verifying that they ‘work’, but also about making sure that they behave as expected, avoiding biases, protecting users, maintaining their stability over time, and complying with applicable ethical and legal standards. The need for comprehensive validation is a growing consensus among experts, researchers, regulators and industry: deploying AI reliably requires explicit standards, assessments and controls.

We summarize four key dimensions that need to be checked in generative AI-based systems to align their results with human expectations:

Ethics and fairness: a model must respect basic ethical principles and avoid harming individuals or groups. This involves detecting and mitigating biases in their responses so as not to perpetuate stereotypes or discrimination. It also requires filtering toxic or offensive content that could harm users. Equity is assessed by ensuring that the system offers consistent treatment to different demographics, without unduly favouring or excluding anyone.
Security and robustness: here we refer to both user safety (that the system does not generate dangerous recommendations or facilitate illicit activities) and technical robustness against errors and manipulations. A safe model must avoid instructions that lead, for example, to illegal behavior, reliably rejecting those requests. In addition, robustness means that the system can withstand adversarial attacks (such as requests designed to deceive you) and that it operates stably under different conditions.
Consistency and reliability: Generative AI results must be consistent, consistent, and correct. In applications such as medical diagnosis or legal assistance, it is not enough for the answer to sound convincing; it must be true and accurate. For this reason, aspects such as the logical coherence of the answers, their relevance with respect to the question asked and the factual accuracy of the information are validated. Its stability over time is also checked (that in the face of two similar requests equivalent results are offered under the same conditions) and its resilience (that small changes in the input do not cause substantially different outputs).
Transparency and explainability: To trust the decisions of an AI-based system, it is desirable to understand how and why it produces them. Transparency includes providing information about training data, known limitations, and model performance across different tests. Many companies are adopting the practice of publishing "model cards," which summarize how a system was designed and evaluated, including bias metrics, common errors, and recommended use cases. Explainability goes a step further and seeks to ensure that the model offers, when possible, understandable explanations of its results (for example, highlighting which data influenced a certain recommendation). Greater transparency and explainability increase accountability, allowing developers and third parties to audit the behavior of the system.

Open data: transparency and more diverse evidence

Properly validating AI models and systems, particularly in terms of fairness and robustness, requires representative and diverse datasets that reflect the reality of different populations and scenarios.

On the other hand, if only the companies that own a system have data to test it, we have to rely on their own internal evaluations. However, when open datasets and public testing standards exist, the community (universities, regulators, independent developers, etc.) can test the systems autonomously, thus functioning as an independent counterweight that serves the interests of society.

A concrete example was given by Meta (Facebook) when it released its Casual Conversations v2 dataset in 2023. It is an open dataset, obtained with informed consent, that collects videos from people from 7 countries (Brazil, India, Indonesia, Mexico, Vietnam, the Philippines and the USA), with 5,567 participants who provided attributes such as age, gender, language and skin tone.

Meta's objective with the publication was precisely to make it easier for researchers to evaluate the impartiality and robustness of AI systems in vision and voice recognition. By expanding the geographic provenance of the data beyond the U.S., this resource allows you to check if, for example, a facial recognition model works equally well with faces of different ethnicities, or if a voice assistant understands accents from different regions.

The diversity that open data brings also helps to uncover neglected areas in AI assessment. Researchers from Stanford's Human-Centered Artificial Intelligence (HAI) showed in the HELM (Holistic Evaluation of Language Models) project that many language models are not evaluated in minority dialects of English or in underrepresented languages, simply because there are no quality data in the most well-known benchmarks.

The community can identify these gaps and create new test sets to fill them (e.g., an open dataset of FAQs in Swahili to validate the behavior of a multilingual chatbot). In this sense, HELM has incorporated broader evaluations precisely thanks to the availability of open data, making it possible to measure not only the performance of the models in common tasks, but also their behavior in other linguistic, cultural and social contexts. This has contributed to making visible the current limitations of the models and to promoting the development of more inclusive and representative systems of the real world or models more adapted to the specific needs of local contexts, as is the case of the ALIA foundational model, developed in Spain.

In short, open data contributes to democratizing the ability to audit AI systems, preventing the power of validation from residing only in a few. They allow you to reduce costs and barriers as a small development team can test your model with open sets without having to invest great efforts in collecting their own data. This not only fosters innovation, but also ensures that local AI solutions from small businesses are also subject to rigorous validation standards.

The validation of applications based on generative AI is today an unquestionable necessity to ensure that these tools operate in tune with our values and expectations. It is not a trivial process, it requires new methodologies, innovative metrics and, above all, a culture of responsibility around AI. But the benefits are clear, a rigorously validated AI system will be more trustworthy, both for the individual user who, for example, interacts with a chatbot without fear of receiving a toxic response, and for society as a whole who can accept decisions based on these technologies knowing that they have been properly audited. And open data helps to cement this trust by fostering transparency, enriching evidence with diversity, and involving the entire community in the validation of AI systems.

Content prepared by Jose Luis Marín, Senior Consultant in Data, Strategy, Innovation & Digitalization. The contents and views reflected in this publication are the sole responsibility of the author.

20/06/2025

Sustainable artificial intelligence: how to minimise the environmental impact of AI

Blog

Artificial intelligence (AI) has become a key technology in multiple sectors, from health and education to industry and environmental management, not to mention the number of citizens who create texts, images or videos with this technology for their own personal enjoyment. It is estimated that in Spain more than half of the adult population has ever used an AI tool.

However, this boom poses challenges in terms of sustainability, both in terms of water and energy consumption and in terms of social and ethical impact. It is therefore necessary to seek solutions that help mitigate the negative effects, promoting efficient, responsible and accessible models for all. In this article we will address this challenge, as well as possible efforts to address it.

What is the environmental impact of AI?

In a landscape where artificial intelligence is all the rage, more and more users are wondering what price we should pay for being able to create memes in a matter of seconds.

To properly calculate the total impact of artificial intelligence, it is necessary to consider the cycles of hardware and software as a whole, as the United Nations Environment Programme (UNEP)indicates. That is, it is necessary to consider everything from raw material extraction, production, transport and construction of the data centre, management, maintenance and disposal of e-waste, to data collection and preparation, modelling, training, validation, implementation, inference, maintenance and decommissioning. This generates direct, indirect and higher-order effects:

The direct impacts include the consumption of energy, water and mineral resources, as well as the production of emissions and e-waste, which generates a considerable carbon footprint.
The indirect effects derive from the use of AI, for example, those generated by the increased use of autonomous vehicles.
Moreover, the widespread use of artificial intelligence also carries an ethical dimension, as it may exacerbate existing inequalities, especially affecting minorities and low-income people. Sometimes the training data used are biased or of poor quality (e.g. under-representing certain population groups). This situation can lead to responses and decisions that favour majority groups.

Some of the figures compiled in the UN document that can help us to get an idea of the impact generated by AI include:

A single request for information to ChatGPT consumes ten times more electricity than a query on a search engine such as Google, according to data from the International Energy Agency (IEA).
Entering a single Large Language Model ( Large Language Models or LLM) generates approximately 300.000 kg of carbon dioxide emissions, which is equivalent to 125 round-trip flights between New York and Beijing, according to the scientific paper "The carbon impact of artificial intelligence".
Global demand for AI water will be between 4.2 and 6.6 billion cubic metres by 2,027, a figure that exceeds the total consumption of a country like Denmark, according to the "Making AI Less "Thirsty": Uncovering and Addressing the Secret Water Footprint of AI Models" study.

Solutions for sustainable AI

In view of this situation, the UN itself proposes several aspects to which attention needs to be paid, for example:

Search for standardised methods and parameters to measure the environmental impact of AI, focusing on direct effects, which are easier to measure thanks to energy, water and resource consumption data. Knowing this information will make it easier to take action that will bring substantial benefit.
Facilitate the awareness of society, through mechanisms that oblige companies to make this information public in a transparent and accessible manner. This could eventually promote behavioural changes towards a more sustainable use of AI.
Prioritise research on optimising algorithms, for energy efficiency. For example, the energy required can be minimised by reducing computational complexity and data usage. Decentralised computing can also be boosted, as distributing processes over less demanding networks avoids overloading large servers.
Encourage the use of renewable energies in data centres, such as solar and wind power. In addition, companies need to be encouraged to undertake carbon offsetting practices.

In addition to its environmental impact, and as seen above, AI must also be sustainable from a social and ethical perspective. This requires:

Avoid algorithmic bias: ensure that the data used represent the diversity of the population, avoiding unintended discrimination.
Transparency in models: make algorithms understandable and accessible, promoting trust and human oversight.
Accessibility and equity: develop AI systems that are inclusive and benefit underprivileged communities.

While artificial intelligence poses challenges in terms of sustainability, it can also be a key partner in building a greener planet. Its ability to analyse large volumes of data allows optimising energy use, improving the management of natural resources and developing more efficient strategies in sectors such as agriculture, mobility and industry. From predicting climate change to designing models to reduce emissions, AI offers innovative solutions that can accelerate the transition to a more sustainable future.

National Green Algorithms Programme

In response to this reality, Spain has launched the National Programme for Green Algorithms (PNAV). This is an initiative that seeks to integrate sustainability in the design and application of AI, promoting more efficient and environmentally responsible models, while promoting its use to respond to different environmental challenges.

The main goal of the NAPAV is to encourage the development of algorithms that minimise environmental impact from their conception. This approach, known as "Green by Design", implies that sustainability is not an afterthought, but a fundamental criterion in the creation of AI models. In addition, the programme seeks to promote research in sustainable IA, improve the energy efficiency of digital infrastructures and promote the integration of technologies such as the green blockchain into the productive fabric.

This initiative is part of the Recovery, Transformation and Resilience Plan, the Spain Digital Agenda 2026 and the National Artificial Intelligence Strategy.. Objectives include the development of a best practice guide, a catalogue of efficient algorithms and a catalogue of algorithms to address environmental problems, the generation of an impact calculator for self-assessment, as well as measures to support awareness-raising and training of AI developers.

Its website functions as a knowledge space on sustainable artificial intelligence, where you can keep up to date with the main news, events, interviews, etc. related to this field. They also organise competitions, such as hackathons, to promote solutions that help solve environmental challenges.

The Future of Sustainable AI

The path towards more responsible artificial intelligence depends on the joint efforts of governments, business and the scientific community. Investment in research, the development of appropriate regulations and awareness of ethical AI will be key to ensuring that this technology drives progress without compromising the planet or society.

Sustainable AI is not only a technological challenge, but an opportunity to transform innovation into a driver of global welfare. It is up to all of us to progress as a society without destroying the planet.

12/06/2025

The importance of licences in the digital environment: an approach accessible to all

Blog

In an increasingly digitised world, the creation, use and distribution of software and data have become essential activities for individuals, businesses and government organisations. However, behind these everyday practices lies a crucial aspect: licensingof both software and data.

Understanding what licences are, their types and their importance is essential to ensure legal and ethical use of digital resources. In this article, we will explore these concepts in a simple and accessible way, as well as discuss a valuable tool called Joinup Licensing Assistant, developed by the European Union.

What are licences and why are they important?

A licence is a legal agreement that grants specific permissions on the use of a digital product, be it software, data, multimedia content or other resources. This agreement sets out the conditions under which the product may be used, modified, distributed or marketed. Licences are essential because they protect the rights of creators, ensure that users understand their rights and obligations, and foster a safe and collaborative digital environment.

The following are some examples of the most popular ones, both for data and software.

Common types of licences

Copyright

Copyright is an automatic protection which arises at the moment of the creation of an original work, be it literary, artistic or scientific. It is not necessary to formally register the work in order for it to be protected by copyright. This right grants the creator exclusive rights over the reproduction, distribution, public communication and transformation of his work.

Ejemplo: When a company creates a dataset on, for example, construction trends, it automatically owns the copyright on that data. This means that others may not use, modify or distribute such data without the explicit permission of the creator.

Public domain

When a work is not protected by copyright, it is considered to be in the public domain. This may occur because the rights have expired, the author has waived them or because the work does not meet the legal requirements for protection. For example, a work that lacks sufficient originality - such as a telephone list or a standard form - does not qualify for protection. Works in the public domain may be used freely by anyone, without the need to obtain permission.

Ejemplo: Many classic works of literature, such as those of William Shakespeare, are in the public domain and can be freely reproduced and adapted.

Creative commons

The Creative Commons licences offer aflexible way to grant permissions for the use of copyrighted works. These licences allow creators to specify which uses they do and do not allow, facilitating the dissemination and re-use of their works under clear conditions. The most common CC licences include:

CC BY (Attribution): permits the use, distribution and creation of derivative works, provided credit is given to the original author.
CC BY-SA (Attribution-Share Alike): in addition to attribution, requires that derivative works be distributed under the same licence.
CC BY-ND (Attribution-No Derivative Works): permits redistribution, commercial and non-commercial, provided the work remains intact and credit is given to the author.
CC0 (Public Domain): allows creators to waive all rights to their works, allowing them to be used freely without attribution.

These licences are especially useful for creators who wish to share their works while retaining certain rights over their use.

GNU General Public License (GPL)

The GNU General Public License (GPL) , created by the Free Software Foundation, guarantees that software licensed under its terms will always remain free and accessible to everyone. This licence is specifically designed for software, not data. It aims to ensure that the software remains free, accessible and modifiable by any user, protecting the freedoms related to its use and distribution.

This licence not only allows users to use, modify and distribute the software, but also requires that any derivative works retain the same terms of freedom. In other words, any software that is distributed or modified under the GPL must remain free for all its users. The GPL is designed to protect four essential freedoms:

The freedom to use the software for any purpose.
The freedom to study how the software works and adapt it to specific needs.
The freedom to distribute copies of the software to help others.
The freedom to improve the software and release the improvements for the benefit of the community.

One of the key features of the GPL is its "copyleft" clause, which requires that any derivative works be licensed under the same terms as the original software. This prevents free software from becoming proprietary and ensures that the original freedoms remain intact.

Ejemplo: Suppose a company develops a programme under the GPL and distributes it to its customers. If any of these customers decide to modify the source code to suit their needs, it is their right to do so. In addition, if the company or customer wishes to redistribute modified versions of the software, they must do so under the same GPL licence, ensuring that any new user also enjoys the original freedoms.

European Union Public Licence (EUPL)

The European Union Public License (EUPL) is a free and open source software licence developed by the European Commission. Designed to facilitate interoperability and cooperation between Europeansoftware, the EUPL allows the free use, modification and distribution of software, ensuring that derivative works are also kept open. In addition to covering software, the EUPL can be applied to ancillary documents such as specifications, user manuals and technical documentation.

Although the EUPL is used for software, in some cases it may be applicable to datasets or content (such as text, graphics, images, documentation or any other material not considered software or structured data),but its use in open data is less common than other specific licences such as Creative Commons or Open Data Commons.

Open Data Commons (ODC-BY)

The Open Data Commons Attribution License (ODC-BY) is a licence designed specifically for databases and datasets, developed by the Open Knowledge Foundation. It aims to allow free use of data, while requiring appropriate acknowledgement of the original creator. This licence is not designed for software, but for structured data, such as statistics, open catalogues or geospatial maps.

ODC-BY allows users to:

Copy, Distribute and use the database.
Create derivative works, such as visualisations, analyses or derivative products.
Adapt data to new needs or combine them with other sources.

The only main condition is attribution: users must credit the original creator appropriately, including clear references to the source.

A notable feature of the ODC-BY is that does not impose a copyleft clause, meaning that derived data can be licensed under other terms, as long as attribution is maintained.

Ejemplo: Imagine that a city publishes its bicycle station database under ODC-BY. A company can download this data, create an app that recommends cycling routes and add new layers of information. As long as you clearly indicate that the original data comes from the municipality, you can offer your app under any licence you wish, even on a commercial basis.

A comparison of these most commonly used licences allows us to better understand their differences:

Licence	Allows commercial use	Permitted modification	Requires attribution	Allos derivative works	Applicable to data	Specialisationsnn
Copyright	Yes, with permission of the author	No, except by agreement with the creator	No	No	It can be applied to databases, but only if they meet certain requirements of creativity and originality in their structure or selection of content. It does not protect the data itself, but the way it is organised or presented.	Original works such as texts, music, films, software and, in some cases, databases whose structure or selection is creative. It does not protect the data itself.
Public domain	Yes	Yes	No	Yes	Yes	Original works such as texts, music, films and software without copyright protection (by expiration, waiver, or legal exclusion)
Creative Commons BY (Attribution)	Yes	Yes, with attribution	Yes	Yes	Yes	Reusable text, images, videos, infographics, web content and datasets, provided that authorship is acknowledged
Creative Commons BY-SA (Attribution-ShareAlike)	Yes	Yes, you must keep the same licence	Yes	Yes, with the same licence	Yes	Collaborative content such as articles, maps, datasets or open educational resources; ideal for community projects
Creative Commons BY-ND (Attribution-NoDerivatives)	Yes	No	Yes	No	Yes, but it is forbidden to modify or combine the data.	Content to be preserved unaltered: official documents, closed infographics, unalterable data sets, etc.
Creative Commons CC0 (Public domain)	Yes	Yes	No	Yes	Yes	All kinds of works: texts, images, music, data, software, etc., which are voluntarily released into the public domain.
GNU General Public License (GPL)	Yes	Yes, it should be kept under the GPL	Yes	Yes	No	Executable software or source code. Not suitable for documentation, multimedia content or databases.
*European Union Public Licence (EUPL)*	Yes	Yes, derivative works should remain open	Yes	Yes	Partially: could be used for technical data, but is not its main purpose	Software developed by public administrations and its associated technical documentation (manuals, specifications, etc.).
Open Data Commons (ODC-BY)	Yes	Yes	Yes	Yes	Yes (specifically designed for open data)	Structured databases such as public statistics, geospatial arrays, open catalogues or administrative registers

Figure 1. Comparative table. Source: own elaboration

Why is it necessary to use licences in the field of open data?

In the field of open data, these licences are essential to ensure that data is available for public use, promoting transparency, innovation and the development of data-driven solutions. In general, the advantages of using clear licences are:

Transparency and open access: clear licences allow citizens, researchers and developers to access and use public data without undue restrictions, fostering government transparency and accountability.
Fostering innovation: By enabling the free use of data, open data licences facilitate the creation of applications, services and analytics that can generate economic and social value.
Collaboration and reuse: licences that allow for the reuse and modification of data encourage collaboration between different entities and disciplines, fostering the development of more robust and complete solutions.
Improved data quality: The availability of open data encourages greater community participation and review, which can lead to an improvement in the quality and accuracy of the data available.
Legal certainty for the re-user: Clear licences provide confidence and certainty to those who re-use data, as they know they can do so legally and without fear of future conflicts.

Introduction to the Joinup Licensing Assistant?

In this complex licensing landscape, choosing the right one can be a daunting task, especially for those with no previous experience in licence management. This is where the Joinup Licensing Assistant, a tool developed by the European Union and available at Joinup.europa.eu, comes in. This collaborative platform is designed to promote the exchange of solutions and best practices between public administrations, companies and citizens, and the Licensing Assistant is one of its star tools.

For those working specifically with data, you may also find useful the report published by data.europa.eu, which provides more detailed recommendations on the selection of licences for open datasets in the European context.

The Joinup Licensing Assistant offers several features and benefits that simplify licence selection and management:

	Functionality		Benefits
	Customised advice: recommends suitable licences according to the type of project and your needs.		Simplifying the selection process: breaks down the choice of licence into clear steps, reducing complexity and time.
	Licence database: access to software licences, content and data, with clear descriptions.		Legal risk reduction: avoids legal problems by providing recommendations that are compatible with project requirements.
	Comparison of licences: allows you to easily see the differences between various licences.		Fostering collaboration and knowledge sharing: facilitates the exchange of experiences between users and public administrations.
	Legal update: provides information that is always up to date with current legislation.		Accessibility and usability: intuitive interface, useful even for those with no legal knowledge.
	Open data support: includes specific options to promote reuse and transparency.		Supporting the sustainability of free software and open data: promotes licences that drive innovation, openness and continuity of projects.

Figure 2. Table of functionality and benefits. Source: own elaboration

Various sectors can benefit from the use of the Joinup Licensing Assistant:.

Public administrations: to apply correct licences on software, content and open data, complying with European standards and encouraging re-use.
Software developers: to align licences with their business models and facilitate distribution and collaboration.
Content creators: to protect their rights and decide how their work can be used and shared.
Researchers and scientists: to publish reusable data to drive collaboration and scientific advances.

Conclusion

In an increasingly interconnected and regulated digital environment, using appropriate licences for software, content and especially open data is essential to ensure the legality, sustainability and impact of digital projects. Proper licence management facilitates collaboration, reuse and secure dissemination of resources, while reducing legal risks and promoting interoperability.

In this context, tools such as the Joinup Licensing Assistant offer valuable support for public administrations, companies and citizens, simplifying the choice of licences and adapting it to each case. Their use contributes to creating a more open, secure and efficient digital ecosystem.

Particularly in the field of open data, clear licences make data truly accessible and reusable, fostering institutional transparency, technological innovation and the creation of social value.

Content prepared by Mayte Toscano, Senior Consultant in Data Economy Technologies. The contents and points of view reflected in this publication are the sole responsibility of the author.

21/05/2025

Guidance for the deployment of data portals. Good practices and recommendations

Documentación

Open data portals help municipalities to offer structured and transparent access to the data they generate in the exercise of their functions and in the provision of the services they are responsible for, while also fostering the creation of applications, services and solutions that generate value for citizens, businesses and public administrations themselves.

The report aims to provide a practical guide for municipal administrations to design, develop and maintain effective open data portals, integrating them into the overall smart city strategy. The document is structured in several sections ranging from strategic planning to technical and operational recommendations necessary for the creation and maintenance of open data portals. Some of the main keys are:

Fundamental principles

The report highlights the importance of integrating open data portals into municipal strategic plans, aligning portal objectives with local priorities and citizens' expectations. It also recommends drawing up a Plan of measures for the promotion of openness and re-use of data (RISP Plan in Spanish acronyms), including governance models, clear licences, an open data agenda and actions to stimulate re-use of data. Finally, it emphasises the need for trained staff in strategic, technical and functional areas, capable of managing, maintaining and promoting the reuse of open data.

General requirements

In terms of general requirements to ensure the success of the portal, the importance of offering quality data, consistent and updated in open formats such as CSV and JSON, but also in XLS, favouring interoperability with national and international platforms through open standards such as DCAT-AP, and guaranteeing effective accessibility of the portal through an intuitive and inclusive design, adapted to different devices. It also points out the obligation to strictly comply with privacy and data protection regulations, especially the General Data Protection Regulation (GDPR).

To promote re-use, the report advises fostering dynamic ecosystems through community events such as hackathons and workshops, highlighting successful examples of practical application of open data. Furthermore, it insists on the need to provide useful tools such as APIs for dynamic queries, interactive data visualisations and full documentation, as well as to implement sustainable funding and maintenance mechanisms.

Technical and functional guidelines

Regarding technical and functional guidelines, the document details the importance of building a robust and scalable technical infrastructure based on cloud technologies, using diverse storage systems such as relational databases, NoSQL and specific solutions for time series or geospatial data. It also highlights the importance of integrating advanced automation tools to ensure consistent data quality and recommends specific solutions to manage real-time data from the Internet of Things (IoT).

In relation to the usability and structure of the portal, the importance of a user-centred design is emphasised, with clear navigation and a powerful search engine to facilitate quick access to data. Furthermore, it stresses the importance of complying with international accessibility standards and providing tools that simplify interaction with data, including clear graphical displays and efficient technical support mechanisms.

The report also highlights the key role of APIs as fundamental tools to facilitate automated and dynamic access to portal data, offering granular queries, clear documentation, robust security mechanisms and reusable standard formats. It also suggests a variety of tools and technical frameworks to implement these APIs efficiently.

Another critical aspect highlighted in the document is the identification and prioritisation of datasets for publication, as the progressive planning of data openness allows adjusting technical and organisational processes in an agile way, starting with the data of greatest strategic relevance and citizen demand.

Finally, the guide recommends establishing a system of metrics and indicators according to the UNE 178301:2015 standard to assess the degree of maturity and the real impact of open data portals. These metrics span strategic, legal, organisational, technical, economic and social domains, providing a holistic approach to measure both the effectiveness of data publication and its tangible impact on society and the local economy.

Conclusions

In conclusion, the report provides a strategic, technical and practical framework that serves as a reference for the deployment of municipal open data portals for cities to maximise their potential as drivers of economic and social development. In addition, the integration of artificial intelligence at various points in open data portal projects represents a strategic opportunity to expand their capabilities and generate a greater impact on citizens.

You can read the full report here.

03/04/2025

The future of the data commons: balancing opportunities and challenges

Blog

The concept of data commons emerges as a transformative approach to the management and sharing of data that serves collective purposes and as an alternative to the growing number of macrosilos of data for private use. By treating data as a shared resource, data commons facilitate collaboration, innovation and equitable access to data, emphasising the communal value of data above all other considerations. As we navigate the complexities of the digital age - currently marked by rapid advances in artificial intelligence (AI) and the continuing debate about the challenges in data governance- the role that data commons can play is now probably more important than ever.

What are data commons?

The data commons refers to a cooperative framework where data is collected, governed and shared among all community participants through protocols that promote openness, equity, ethical use and sustainability. The data commons differ from traditional data-sharing models mainly in the priority given to collaboration and inclusion over unitary control.

Another common goal of the data commons is the creation of collective knowledge that can be used by anyone for the good of society. This makes them particularly useful in addressing today's major challenges, such as environmental challenges, multilingual interaction, mobility, humanitarian catastrophes, preservation of knowledge or new challenges in health and health care.

In addition, it is also increasingly common for these data sharing initiatives to incorporate all kinds of tools to facilitate data analysis and interpretation , thus democratising not only the ownership of and access to data, but also its use.

For all these reasons, data commons could be considered today as a criticalpublic digital infrastructure for harnessing data and promoting social welfare.

Principles of the data commons

The data commons are built on a number of simple principles that will be key to their proper governance:

Openness and accessibility: data must be accessible to all authorised persons.
Ethical governance: balance between inclusion and privacy.
Sustainability: establish mechanisms for funding and resources to maintain data as a commons over time.
Collaboration: encourage participants to contribute new data and ideas that enable their use for mutual benefit.
Trust: relationships based on transparency and credibility between stakeholders.

In addition, if we also want to ensure that the data commons fulfil their role as public domain digital infrastructure, we must guarantee other additional minimum requirements such as: existence of permanent unique identifiers , documented metadata , easy access through application programming interfaces (APIs), portability of the data, data sharing agreements between peers and ability to perform operations on the data.

The important role of the data commons in the age of Artificial Intelligence

AI-driven innovation has exponentially increased the demand for high-quality, diverse data sets a relatively scarce commodityat a large scale that may lead to bottlenecks in the future development of the technology and, at the same time, makes data commons a very relevant enabler for a more equitable AI. By providing shared datasets governed by ethical principles, data commons help mitigate common risks such as risks, data monopolies and unequal access to the benefits of AI.

Moreover, the current concentration of AI developments also represents a challenge for the public interest. In this context, the data commons hold the key to enable a set of alternative, public and general interest-oriented AI systems and applications, which can contribute to rebalancing this current concentration of power. The aim of these models would be to demonstrate how more democratic, public interest-oriented and purposeful systems can be designed based on public AI governance principles and models.

However, the era of generative AI also presents new challenges for data commons such as, for example and perhaps most prominently, the potential risk of uncontrolled exploitation of shared datasets that could give rise to new ethical challenges due to data misuse and privacy violations.

On the other hand, the lack of transparency regarding the use of the data commons by the AI could also end up demotivating the communities that manage them, putting their continuity at risk. This is due to concerns that in the end their contribution may be benefiting mainly the large technology platforms, without any guarantee of a fairer sharing of the value and impact generated as originally intended".

For all of the above, organisations such as Open Future have been advocating for several years now for Artificial Intelligence to function as a common good, managed and developed as a digital public infrastructure for the benefit of all, avoiding the concentration of power and promoting equity and transparency in both its development and its application.

To this end, they propose a set of principles to guide the governance of the data commons in its application for AI training so as to maximise the value generated for society and minimise the possibilities of potential abuse by commercial interests:

Share as much data as possible, while maintaining such restrictions as may be necessary to preserve individual and collective rights.
Be fully transparent and provide all existing documentation on the data, as well as on its use, and clearly distinguish between real and synthetic data.
Respect decisions made about the use of data by persons who have previously contributed to the creation of the data, either through the transfer of their own data or through the development of new content, including respect for any existing legal framework.
Protect the common benefit in the use of data and a sustainable use of data in order to ensure proper governance over time, always recognising its relational and collective nature.
Ensuring the quality of data, which is critical to preserving its value as a common good, especially given the potential risks of contamination associated with its use by AI.
Establish trusted institutions that are responsible for data governance and facilitate participation by the entire data community, thus going a step beyond the existing models for data intermediaries.

Use cases and applications

There are currently many real-world examples that help illustrate the transformative potential of data commons:

Health data commons : projects such as the National Institutes of Health's initiative in the United States - NIH Common Fund to analyse and share large biomedical datasets, or the National Cancer Institute's Cancer Research Data Commons , demonstrate how data commons can contribute to the acceleration of health research and innovation.
AI training and machine learning: the evaluation of AI systems depends on rigorous and standardised test data sets. Initiatives such as OpenML or MLCommons build open, large-scale and diverse datasets, helping the wider community to deliver more accurate and secure AI systems.
Urban and mobility data commons : cities that take advantage of shared urban data platforms improve decision-making and public services through collective data analysis, as is the case of Barcelona Dades, which in addition to a large repository of open data integrates and disseminates data and analysis on the demographic, economic, social and political evolution of the city. Other initiatives such as OpenStreetMaps itself can also contribute to providing freely accessible geographic data.
Culture and knowledge preservation: with such relevant initiatives in this field as Mozilla's Common Voice project to preserve and revitalise the world's languages, or Wikidata, which aims to provide structured access to all data from Wikimedia projects, including the popular Wikipedia.

Challenges in the data commons

Despite their promise and potential as a transformative tool for new challenges in the digital age, the data commons also face their own challenges:

Complexity in governance: Striking the right balance between inclusion, control and privacy can be a delicate task.
Sustainability: Many of the existing data commons are fighting an ongoing battle to try to secure the funding and resources they need to sustain themselves and ensure their long-term survival.
Legal and ethical issues: addressing challenges relating to intellectual property rights, data ownership and ethical use remain critical issues that have yet to be fully resolved.
Interoperability: Ensuring compatibility between datasets and platforms is a persistent technical hurdle in almost any data sharing initiative, and the data commons were to be no exception.

The way forward

To unlock their full potential, the data commons require collective action and a determined commitment to innovation. Key actions include:

Develop standardised governance models that strike a balance between ethical considerations and technical requirements.
Apply the principle of reciprocity in the use of data, requiring those who benefit from it to share their results back with the community.
Protection of sensitive data through anonymisation, preventing data from being used for mass surveillance or discrimination.
Encourage investment in infrastructure to support scalable and sustainable data exchange.
Promote awareness of the social benefits of data commons to encourage participation and collaboration.

Policy makers, researchers and civil society organisations should work together to create an ecosystem in which the data commons can thrive, fostering more equitable growth in the digital economy and ensuring that the data commons can benefit all.

Conclusion

The data commons can be a powerful tool for democratising access to data and fostering innovation. In this era defined by AI and digital transformation, they offer us an alternative path to equitable, sustainable and inclusive progress. Addressing its challenges and adopting a collaborative governance approach through cooperation between communities, researchers and regulators will ensure fair and responsible use of data.

This will ensure that data commons become a fundamental pillar of the digital future, including new applications of Artificial Intelligence, and could also serve as a key enabling tool for some of the key actions that are part of the recently announced European competitiveness compass, such as the new Data Union strategy and the AI Gigafactories initiative.

Content prepared by Carlos Iglesias, Open data Researcher and consultant, World Wide Web Foundation. The contents and views expressed in this publication are the sole responsibility of the author.

13/02/2025

1. Ask expansive questions

Practical situation:

2. Ask for structured content

Practical situation:

3. Connect the dots

Practical situation:

Objectives and relevance of the template

Template structure

1. General information

2. List of data sources

3. Aspects of data processing

Balancing transparency and trade secrets

Compliance, deadlines and penalties

Next Steps for Suppliers

A key tool for governing data

Conclusions

What is the Green Deal Data Space?

A report with recommendations for the GDDS

1. Data harmonization

2. Semantic interoperability

3. Metadata and data curation

4. Data Exchange and Federated Provisioning

5. Inclusive and sustainable governance

Conclusion

Some references in national and European legislation

Citizen science initiatives and data management plans

Open Data for Citizen Science with the PPSR Core Standard

Conclusions

How to activate the reasoning chain

Benefits and three practical examples

Chronological reasoning

​Let's think about the following prompt:

Spatial reasoning

Probabilistic reasoning

The Great Simulation

Closing

What needs to be validated in generative AI-based systems?

Open data: transparency and more diverse evidence

What is the environmental impact of AI?

Solutions for sustainable AI

National Green Algorithms Programme

The Future of Sustainable AI

What are licences and why are they important?

Common types of licences

Copyright

Public domain

Creative commons

GNU General Public License (GPL)

European Union Public Licence (EUPL)

Open Data Commons (ODC-BY)

Why is it necessary to use licences in the field of open data?

Introduction to the Joinup Licensing Assistant?

Conclusion

Fundamental principles

General requirements

Technical and functional guidelines

Conclusions

What are data commons?

Principles of the data commons

The important role of the data commons in the age of Artificial Intelligence

Use cases and applications

Challenges in the data commons

The way forward

Conclusion

Let's think about the following prompt: