Blog

When dealing with the liability arising from the use of autonomous systems based on the use of artificial intelligence , it is common to refer to the ethical dilemmas that a traffic accident can pose. This example is useful to illustrate the problem of liability for damages caused by an accident or even to determine other types of liability in the field of road safety (for example, fines for violations of traffic rules).

Let's imagine that the autonomous vehicle has been driving at a higher speed than the permitted speed or that it has simply skipped a signal and caused an accident involving other vehicles. From the point of view of the legal risks, the liability that would be generated and, specifically, the impact of data in this scenario, we could ask some questions that help us understand the practical scope of this problem:

  • Have all the necessary datasets of sufficient quality to deal with traffic risks in different environments (rural, urban, dense cities, etc.) been considered in the design and training?

  • What is the responsibility if the accident is due to poor integration of the artificial intelligence tool with the vehicle or a failure of the manufacturer that prevents the correct reading of the signs?

  • Who is responsible if the problem stems from incorrect or outdated information on traffic signs?

In this post we are going to explain what aspects must be considered when assessing the liability that can be generated in this type of case.

The impact of data from the perspective of the subjects involved

In the design, training, deployment and use of artificial intelligence systems, the effective control of the data used plays an essential role in the management of legal risks. The conditions of its processing can have important consequences from the perspective of liability in the event of damage or non-compliance with the applicable regulations.

A rigorous approach to this problem requires distinguishing according to each of the subjects involved in the process, from its initial development to its effective use in specific circumstances, since the conditions and consequences can be very different. In this sense, it is necessary to identify the origin of the damage or non-compliance in order to impute the legal consequences to the person who should effectively be considered responsible:

  • Thus, damage or non-compliance may be determined by a design problem in the application used or in its training, so that certain data is misused for this purpose. Continuing with the example of an autonomous vehicle, this would be the case of accessing the data of the people traveling in it without consent.

  • However, it is also possible that the problem originates from the person who deploys the tool in each environment for real use, a position that would be occupied by the vehicle manufacturer. This could happen if, for its operation, data is accessed without the appropriate permissions or if there are restrictions that prevent access to the information necessary to guarantee its proper functioning.

  • The problem could also be generated by the person or entity using the tool itself. Returning to the example of the vehicle, it could be stated that the ownership of the vehicle corresponds to a company or an individual that has not carried out the necessary periodic inspections or updated the system when necessary.

  • Finally, there is the possibility that the legal problem of liability is determined by the conditions under which the data are provided at their original source. For example, if the data is inaccurate: the information about the road on which the vehicle is traveling is not up to date or the data emitted by traffic signs is not sufficiently accurate.

Challenges related to the technological environment: complexity and opacity

In addition, the very uniqueness of the technology used may significantly condition the attribution of liability. Specifically, technological opacity – that is, the difficulty in understanding why a system makes a specific decision – is one of the main challenges when it comes to addressing the legal challenges posed by artificial intelligence, as it makes it difficult to determine the responsible subject. This is a problem that acquires special importance with regard to the lawful origin of the data and, likewise, the conditions under which its processing takes place. In fact, this was precisely the main stumbling block that generative artificial intelligence encountered in the initial moments of its landing in Europe: the lack of adequate conditions of transparency regarding the processing of personal data justified the temporary halt of its commercialization until the necessary adjustments were made.

In this sense, the publication of the data used for the training phase becomes an additional guarantee from the perspective of legal certainty and, specifically, to verify the regulatory compliance conditions of the tool.

On the other hand, the complexity inherent in this technology poses an additional difficulty in terms of the imputation of the damage that may be caused and, consequently, in the determination of who should pay for it. Continuing with the example of the autonomous vehicle, it could be the case that various causes overlap, such as the inaccuracy of the data provided by traffic signs and, at the same time, a malfunction of the computer application by not detecting potential inconsistencies between the data used and its actual needs.

What does the regulation of the European Regulation on artificial intelligence say about it?

Regulation (EU) 2024/1689 establishes a harmonised regulatory framework across the European Union in relation to artificial intelligence. With regard to data, it includes some specific obligations for systems classified as "high risk", which are those contemplated in Article 6 and in the list in Annex III (biometric identification, education, labour management, access to essential services, etc.). In this sense, it incorporates a strict regime of technical requirements, transparency, supervision and auditing, combined with conformity assessment procedures prior to its commercialization and post-market control mechanisms, also establishing precise responsibilities for suppliers, operators and other actors in the value chain.

As regards data governance, a risk management system  should be put in place covering the entire lifecycle of the tool and assessing, mitigating, monitoring and documenting risks to health, safety and fundamental rights. Specifically, training, validation, and testing datasets are required to be:

  • Relevant, representative, complete and as error-free as possible for the intended purpose.

  • Managed in accordance with strict governance practices that mitigate bias and discrimination, especially when they may affect the fundamental rights of vulnerable or minority groups.

  • The Regulation also lays down strict conditions for the exceptional use of special categories of personal data with regard to the detection and, where appropriate, correction of bias.

With regard to technical documentation and record keeping, the following are required:

  • The preparation and maintenance of exhaustive technical documentation. In particular, with regard to transparency, complete and clear instructions for use should be provided,      including information on data and output results, among other things.

  • Systems should allow for the automatic recording of relevant events (logs) throughout their life cycle to ensure traceability and facilitate post-market surveillance, which can be very useful when checking the incidence of the data used.

As regards liability, that regulation is based on an approach that is admittedly limited from two points of view:

  • Firstly, it merely empowers Member States to establish a sanctioning regime that provides for the imposition of fines and other means of enforcement, such as warnings and non-pecuniary measures, which must be effective, proportionate and dissuasive of non-compliance with the regulation. They are, therefore, instruments of an administrative nature and punitive in nature, that is, punishment for non-compliance with the obligations established in said regulation, among which are those relating to data governance and the documentation and conservation of records referred to above.

  • However, secondly, the European regulator has not considered it appropriate to establish specific provisions regarding civil liability with the aim of compensating for the damage caused. This is an issue of great relevance that even led the European Commission to formulate a proposal for a specific Directive in 2022. Although its processing has not been completed, it has given rise to an interesting debate whose main arguments have been systematised in a comprehensive report by the European Parliament analysing the impact that this regulation could have.

No clear answers: open debate and regulatory developments

Thus, despite the progress made by the approval of the 2024 Regulation, the truth is that the regulation of liability arising from the use of artificial intelligence tools remains an open question on which there is no complete and developed regulatory framework. However, once the approach regarding the legal personification of robots that arose a few years ago has been overcome, it is unquestionable that artificial intelligence in itself cannot be considered a legally responsible subject.

As emphasized above, this is a complex debate in which it is not possible to offer simple and general answers, since it is essential to specify them in each specific case, taking into account the subjects that have intervened in each of the phases of design, implementation and use of the corresponding tool. It will therefore be these subjects who will have to assume the corresponding responsibility, either for the compensation of the damage caused or, where appropriate, to face the sanctions and other administrative measures in the event of non-compliance with the regulation.

In short, although the European regulation on artificial intelligence of 2024 may be useful to establish standards that help determine when a damage caused is contrary to law and, therefore, must be compensated, the truth is that it is an unclosed debate that will have to be redirected applying the general rules on consumer protection or defective products, taking into account the singularities of this technology. And, as far as administrative responsibility is concerned, it will be necessary to wait for the initiative that was announced a few months ago  and that is pending formal approval by the Council of Ministers for its subsequent parliamentary processing in the Spanish Parliament.

Content prepared by Julián Valero, Professor at the University of Murcia and Coordinator of the Research Group "Innovation, Law and Technology" (iDerTec). The contents and points of view reflected in this publication are the sole responsibility of its author.

calendar icon
Blog

Open data from public sources has evolved over the years, from being simple repositories of information to constituting dynamic ecosystems that can transform public governance. In this context, artificial intelligence (AI) emerges as a catalytic technology that benefits from the value of open data and exponentially enhances its usefulness. In this post we will see what the mutually beneficial symbiotic relationship between AI and open data looks like.

Traditionally, the debate on open data has focused on portals: the platforms on which governments publish information so that citizens, companies and organizations can access it. But the so-called "Third Wave of Open Data," a term by New York University's GovLab, emphasizes that it is no longer enough to publish datasets on demand or by default. The important thing is to think about the entire ecosystem: the life cycle of data, its exploitation, maintenance and, above all, the value it generates in society.

What role can open data play in AI?

In this context, AI appears as a catalyst capable of automating tasks, enriching open government data (DMOs), facilitating its understanding and stimulating collaboration between actors.

Recent research, developed by European universities, maps how this silent revolution is happening. The study proposes a classification of uses according to two dimensions:

  1. Perspective, which in turn is divided into two possible paths:

    1. Inward-looking (portal): The focus is on the internal functions of data portals.

    2. Outward-looking (ecosystem): the focus is extended to interactions with external actors (citizens, companies, organizations).

  2. Phases of the data life cycle, which can be divided into pre-processing, exploration, transformation and maintenance.

In summary, the report identifies these eight types of AI use in government open data, based on perspective and phase in the data lifecycle.

Table showing how AI supports different roles across the Open Government Data (OGD) lifecycle from two perspectives: "Inward Looking – Portal Perspective" and "Outward Looking – Ecosystem Perspective." The lifecycle stages are Pre-processing, Exploration, Transformation, and Maintenance.  Portal Perspective:  Pre-processing: AI acts as a Portal Curator, automating tagging, labeling, categorization, quality checks, and anonymization of sensitive data using privacy risk models.  Exploration: AI serves as a Portal Explorer, enhancing user access to data via natural language interfaces, improved search, and chatbots.  Transformation: AI functions as a Portal Linker, converting and aggregating data, such as transforming it into Linked OGD using machine learning.  Maintenance: AI operates as a Portal Monitor, ensuring compliance with standards, detecting anomalies, and improving metadata quality.  Ecosystem Perspective:  Pre-processing: AI is an Ecosystem Data Retriever, sourcing external data (e.g., legal texts) to enrich the OGD ecosystem.  Exploration: AI acts as an Ecosystem Connector, linking stakeholders by recommending relevant datasets and assets.  Transformation: AI becomes an Ecosystem Value Developer, enabling services and products using OGD, such as AI-powered dashboards and analytics suggestions.  Maintenance: AI serves as an Ecosystem Engager, encouraging ongoing stakeholder participation by predicting portal usage and enhancing engagement.

Figure 1. Eight uses of AI to improve government open data. Source: presentation “Data for AI or AI for data: artificial intelligence as a catalyst for open government ecosystems”, based on the report of the same name, from EU Open Data Days 2025.

A continuación, se detalla cada uno de estos usos:

1. Portal curator

This application focuses on pre-processing data within the portal. AI helps organize, clean, anonymize, and tag datasets before publication. Some examples of tasks are:

  • Automation and improvement of data publication tasks.

  • Performing auto-tagging and categorization functions.

  • Data anonymization to protect privacy.

  • Automatic cleaning and filtering of datasets.

  • Feature extraction and missing data handling.

2. Ecosystem data retriever

Also in the pre-processing phase, but with an external focus, AI expands the coverage of portals by identifying and collecting information from diverse sources. Some tasks are:

  • Retrieve structured data from legal or regulatory texts.

  • News mining to enrich datasets with contextual information.

  • Integration of urban data from sensors or digital records.

  • Discovery and linking of heterogeneous sources.

  • Conversion of complex documents into structured information.

3. Portal explorer

In the exploration phase, AI systems can also make it easier to find and interact with published data, with a more internal approach. Some use cases:

  • Develop semantic search engines to locate datasets.

  • Implement chatbots that guide users in data exploration.
  • Provide natural language interfaces for direct queries.

  • Optimize the portal's internal search engines.

  • Use language models to improve information retrieval.

4. Ecosystem connector

Operating also in the exploration phase, AI acts as a bridge between actors and ecosystem resources. Some examples are:

  • Recommend relevant datasets to researchers or companies.

  • Identify potential partners based on common interests.

  • Extract emerging themes to support policymaking.

  • Visualize data from multiple sources in interactive dashboards.

  • Personalize data suggestions based on social media activity.

5. Portal linker

This functionality focuses on the transformation of data within the portal. Its function is to facilitate the combination and presentation of information for different audiences. Some tasks are:

  • Convert data into knowledge graphs (structures that connect related information, known as Linked Open Data).

  • Summarize and simplify data with NLP (Natural Language Processing) techniques.

  • Apply automatic reasoning to generate derived information.

  • Enhance multivariate visualization of complex datasets.

  • Integrate diverse data into accessible information products.

6. Ecosystem value developer

In the transformation phase and with an external perspective, AI generates products and services based on open data that provide added value. Some tasks are:

  • Suggest appropriate analytical techniques based on the type of dataset.

  • Assist in the coding and processing of information.

  • Create dashboards based on predictive analytics.

  • Ensure the correctness and consistency of the transformed data.

  • Support the development of innovative digital services.

7. Portal monitor

It focuses on portal maintenance, with an internal focus. Their role is to ensure quality, consistency, and compliance with standards. Some tasks are:

  • Detect anomalies and outliers in published datasets.

  • Evaluate the consistency of metadata and schemas.

  • Automate data updating and purification processes.

  • Identify incidents in real time for correction.

  • Reduce maintenance costs through intelligent monitoring.

8. Ecosystem engager

And finally, this function operates in the maintenance phase, but outwardly. It seeks to promote citizen participation and continuous interaction. Some tasks are:

  • Predict usage patterns and anticipate user needs.

  • Provide personalized feedback on datasets.

  • Facilitate citizen auditing of data quality.

  • Encourage participation in open data communities.

  • Identify user profiles to design more inclusive experiences.

What does the evidence tell us?

The study is based on a review of more than 70 academic papers examining the intersection between AI and OGD (open government data). From these cases, the authors observe that:

  • Some of the defined profiles, such as portal curator, portal explorer and portal monitor, are relatively mature and have multiple examples in the literature.

  • Others, such as ecosystem value developer and ecosystem engager, are less explored, although they have the most potential to generate social and economic impact.

  • Most applications today focus on automating specific tasks, but there is a lot of scope to design more comprehensive architectures, combining several types of AI in the same portal or across the entire data lifecycle.

From an academic point of view, this typology provides a common language and conceptual structure to study the relationship between AI and open data. It allows identifying gaps in research and guiding future work towards a more systemic approach.

In practice, the framework is useful for:

  • Data portal managers: helps them identify what types of AI they can implement according to their needs, from improving the quality of datasets to facilitating interaction with users.

  • Policymakers: guides them on how to design AI adoption strategies in open data initiatives, balancing efficiency, transparency, and participation.

  • Researchers and developers: it offers them a map of opportunities to create innovative tools that address specific ecosystem needs.

Limitations and next steps of the synergy between AI and open data

In addition to the advantages, the study recognizes some pending issues that, in a way, serve as a roadmap for the future. To begin with, several of the applications that have been identified are still in early stages or are conceptual. And, perhaps most relevantly, the debate on the risks and ethical dilemmas of the use of AI in open data has not yet been addressed in depth: bias, privacy, technological sustainability.

In short, the combination of AI and open data is still a field under construction, but with enormous potential. The key will be to move from isolated experiments to comprehensive strategies, capable of generating social, economic and democratic value. AI, in this sense, does not work independently of open data: it multiplies it and makes it more relevant for governments, citizens and society in general.

calendar icon
Evento

For the first time in the history of the organization, Spain will host the Global Summit of the Open Government Partnership (OGP), an international institution of reference in open government and citizen participation. From 6 to 10 October 2025, Vitoria-Gasteiz will become the world capital of open government, welcoming more than 2,000 representatives of governments, civil society organisations and public policy experts from all over the world.

Although registration for the Summit is now closed due to high demand, citizens will be able to follow some of the plenary sessions through online broadcasts and participate in the debates through social networks. In addition, the results and commitments arising from the Summit will be available on the OGP and Government of Spain digital platforms.

In this post, we review the objective, program of activities and more information of interest.

Program of activities of a global event

The OGP Global Summit 2025 will take place at the Europa Conference Centre in Vitoria-Gasteiz, where an ambitious agenda will be developed aligned with the Co-Presidency Programme of the Government of Spain and the Philippine organisation Bankay Kita, Cielo Magno. This agenda is structured around three fundamental thematic axes:

  • People: Activities that address the protection of civic space, the strengthening of democracy, and balancing the contributions of government, civil society, and the private sector. This axis seeks to ensure that all social actors have a voice in democratic processes.

  • Institutions: This block will address the participation of all branches of government to improve transparency, accountability, and citizen participation at all levels of government.

  • Technology and data: It will explore digital rights, social media governance, and internet freedom, as well as promoting digital civic space and freedom of expression in the digital age.

The OGP Summit's programming includes high-level plenary sessions, specialized workshops, side events, and networking spaces  that will facilitate knowledge sharing and alliance building. You can check the full program here, among the highlights are:

  • Artificial intelligence and open government: the participatory governance of AI and how to ensure that technological development respects democratic principles and human rights will be discussed.

  • Algorithmic transparency: the need to make algorithmic systems used in public decision-making visible and understandable will be discussed.

  • Open Justice: It will explore how to strengthen the rule of law through more transparent and accessible judicial systems for citizens.

  • Inclusive participation: experiences will be shared on how to ensure that populations in vulnerable situations can effectively participate in democratic processes.

  • Open public procurement: best practices will be presented to make public spending more transparent and efficient through open procurement processes.

Among the most relevant sessions for the open data ecosystem, the one organized by Red.es "When AI meets open data" stands out, which will be held on the 8th at 9 a.m. Through a round table,  it will be shown how artificial intelligence and open data enhance each other. On the one hand, AI helps to get more out of open data, and on the other hand, this data is essential for training and improving AI systems.

In addition, at the same time, on Thursday 9, the presentation "From data to impact through public-private partnerships and sharing ecosystems" will be held, organized by the General Directorate of Data of the Ministry for Digital Transformation and Public Function. This session will address how public-private sector collaborations can maximize the value of data to make a real impact on society, exploring innovative models of data sharing that respect privacy and foster innovation.

A legacy of democratic transformation

The Vitoria-Gasteiz Summit adds to the tradition of the eight previous summits held in Canada, Georgia, Estonia, France, Korea, Mexico, the United Kingdom and Brazil. Each of these summits has contributed to strengthening the global open government movement, generating concrete commitments that have transformed the relationship between governments and citizens.

In this edition, the most promising and impactful reforms will be recognized through the Open Gov Awards, celebrating innovation and progress in open government globally. These awards highlight initiatives that have demonstrated a real impact on the lives of citizens and that can serve as an inspiration for other countries and territories.

Multi-stakeholder engagement and collaboration

A distinctive feature of OGP is its multi-stakeholder approach, which ensures that both governments and civil society organizations have a say in defining open government agendas. This Summit will be no exception, and will be attended by representatives of citizen organizations, academics, businessmen and activists working for a more participatory and transparent democracy.

At the same time, other events will be held that will complement the official agenda. These activities will address specific topics such as the protection of whistleblowers, youth participation or the integration of the gender perspective in public policies.

This year, the OGP Global Summit 2025 in Vitoria-Gasteiz aims to generate concrete commitments that strengthen democracy in the digital age. As determined by the Open Government Partnership, the participating countries would make new commitments in their national action plans, especially in areas such as the governance of artificial intelligence, the protection of digital civic space and the fight against disinformation.

In summary, the OGP 2025 Global Summit in Vitoria-Gasteiz marks a pivotal moment for the future of democracy. In a context of growing challenges for democratic institutions, this meeting reaffirms the importance of maintaining open, transparent and participatory governments as fundamental pillars of free and prosperous societies.

calendar icon
Blog

The idea of conceiving artificial intelligence (AI) as a service for immediate consumption or utility, under the premise that it is enough to "buy an application and start using it", is gaining more and more ground. However, getting on board with AI isn't like buying conventional software and getting it up and running instantly. Unlike other information technologies, AI will hardly be able to be used with the philosophy of plug and play. There is a set of essential tasks that users of these systems should undertake, not only for security and legal compliance reasons, but above all to obtain efficient and reliable results.

The Artificial Intelligence Regulation (RIA)[1]

The RIA defines frameworks that should be taken into account by providers[2] and those responsible for deploying[3] AI. This is a very complex rule whose orientation is twofold. Firstly, in an approach that we could define as high-level, the regulation establishes a set of red lines that can never be crossed. The European Union approaches AI from a human-centred and human-serving approach. Therefore, any development must first and foremost ensure that fundamental rights are not violated or that no harm is caused to the safety and integrity of people. In addition, no AI that could generate systemic risks to democracy and the rule of law will be admitted. For these objectives to materialize, the RIA deploys a set of processes through a product-oriented approach. This makes it possible to classify AI systems according to their level of risk, -low, medium, high- as well as general-purpose AI models[4]. And also, to establish, based on this categorization, the obligations that each participating subject must comply with to guarantee the objectives of the standard.

Given the extraordinary complexity of the European regulation, we would like to share in this article some common principles that can be deduced from reading it and could inspire good practices on the part of public and private organisations. Our approach is not so much on defining a roadmap for a given information system as on highlighting some elements that we believe can be useful in ensuring that the deployment and use of this technology are safe and efficient, regardless of the level of risk of each AI-based information system.

Define a clear purpose

The deployment of an AI system is highly dependent on the purpose pursued by the organization. It is not about jumping on the bandwagon of a fashion. It is true that the available public information seems to show that the integration of this type of technology is an important part of the digital transformation processes of companies and the Administration, providing greater efficiency and capabilities. However, it cannot become a fad to install any of the Large Language Models (LLMs). Prior reflection is needed that takes into account what the needs of the organization are and defines what type of AI will contribute to the improvement of our capabilities. Not adopting this strategy could put our bank at risk, not only from the point of view of its operation and results, but also from a legal perspective. For example, introducing an LLM or chatbot  into a high-decision-making risk environment could result in reputational impacts or liability. Inserting this LLM in a medical environment, or using a chatbot in a sensitive context with an unprepared population or in critical care processes, could end up generating risk situations with unforeseeable consequences for people.

Do no evil

The principle of non-malefficiency is a key element and should decisively inspire our practice in the world of AI. For this reason, the RIA establishes a series of practices expressly prohibited to protect the fundamental rights and security of people. These prohibitions focus on preventing manipulations, discrimination, and misuse of AI systems that can cause significant harm.

Categories of Prohibited Practices

1. Manipulation and control of behavior. Through the use of subliminal or manipulative techniques that alter the behavior of individuals or groups, preventing informed decision-making and causing considerable damage.

2. Exploiting vulnerabilities. Derived from age, disability or social/economic situation to substantially modify behavior and cause harm.

3. Social Scoring. AI that evaluates people based on their social behavior or personal characteristics, generating ratings with effects for citizens that result in unjustified or disproportionate treatment.

4. Criminal risk assessment based on profiles. AI used to predict the likelihood of committing crimes solely through profiling or personal characteristics. Although its use for criminal investigation is admitted when the crime has actually been committed and there are facts to be analyzed.

5. Facial recognition and biometric databases. Systems for the expansion of facial recognition databases through the non-selective extraction of facial images from the Internet or closed circuit television.

6. Inference of emotions in sensitive environments. Designing or using AI to infer emotions at work or in schools, except for medical or safety reasons.

7. Sensitive biometric categorization. Develop or use AI that classifies individuals based on biometric data to infer race, political opinions, religion, sexual orientation, etc.

8. Remote biometric identification in public spaces. Use of "real-time" remote biometric identification systems in public spaces for police purposes, with very limited exceptions (search for victims, prevention of serious threats, location of suspects of serious crimes).

Apart from the expressly prohibited conduct, it is important to bear in mind that the principle of non-maleficence implies that we cannot use an AI system with the clear intention of causing harm, with the awareness that this could happen or, in any case, when the purpose we pursue is contrary to law.

Ensure proper data governance

The concept of data governance is found in Article 10 of the RIA and applies to high-risk systems. However, it contains a set of principles that are highly cost-effective when deploying a system at any level. High-risk AI systems that use data must be developed with training, validation, and testing suites that meet quality criteria. To this end, certain governance practices are defined to ensure:

  • Proper design.
  • That the collection and origin of the data, and in the case of personal data the purpose pursued, are adequate and legitimate.
  • Preparation processes such as annotation, labeling, debugging, updating, enrichment, and aggregation are adopted.
  • That the system is designed with use cases whose information is consistent with what the data is supposed to measure and represent.
  • Ensure data quality by ensuring the availability, quantity, and adequacy of the necessary datasets.
  • Detect and review biases that may affect the health and safety of people, rights or generate discrimination, especially when data outputs influence the input information of future operations. Measures should be taken to prevent and correct these biases.
  • Identify and resolve gaps or deficiencies in data that impede RIA compliance, and we would add legislation.
  • The datasets used should be relevant, representative, complete and with statistical properties appropriate for their intended use and should consider the geographical, contextual or functional characteristics necessary for the system, as well as ensure its diversity. In addition, they shall be error-free and complete in view of their intended purpose.

AI is a technology that is highly dependent on the data that powers it. From this point of view, not having data governance can not only affect the operation of these tools, but could also generate liability for the user.

In the not too distant future, the obligation for high-risk systems to obtain a CE marking issued by a notified body (i.e., designated by a member state of the European Union) will provide conditions of reliability to the market. However, for the rest of the lower-risk systems, the obligation of transparency applies. This does not at all imply that the design of this AI should not take these principles into account as far as possible. Therefore, before making a contract, it would be reasonable to verify the available pre-contractual information both in relation to the characteristics of the system and its reliability and with respect to the conditions and recommendations for deployment and use.

Another issue concerns our own organization. If we do not have the appropriate regulatory, organizational, technical and quality compliance measures that ensure the reliability of our own data, we will hardly be able to use AI tools that feed on it. In the context of the RIA, the user of a system may also incur liability. It is perfectly possible that a product of this nature has been properly developed by the supplier and that in terms of reproducibility the supplier can guarantee that under the right conditions the system works properly. What developers and vendors cannot solve are the inconsistencies in the datasets that the user-client integrates into the platform. It is not your responsibility if the customer failed to properly deploy a General Data Protection Regulation compliance framework or is using the system for an unlawful purpose. Nor will it be their responsibility for the client to maintain outdated or unreliable data sets that, when introduced into the tool, generate risks or contribute to inappropriate or discriminatory decision-making.

Consequently, the recommendation is clear: before implementing an AI-based system, we must ensure that data governance and compliance with current legislation are adequately guaranteed.

Ensuring Safety

AI is a particularly sensitive technology that presents specific security risks, such as the corruption of data sets. There is no need to look for fancy examples. Like any information system, AI requires organizations to deploy and use them securely. Consequently, the deployment of AI in any environment requires the prior development of a risk analysis that allows identifying which are the organizational and technical measures that guarantee a safe use of the tool.

Train your staff

Unlike the GDPR, in which this issue is implicit, the RIA expressly establishes the duty to train as an obligation. Article 4 of the RIA is so precise that it is worthwhile to reproduce it in its entirety:

Providers and those responsible for deploying AI systems shall take measures to ensure that, to the greatest extent possible, their staff and others responsible on their behalf for the operation and use of AI systems have a sufficient level of AI literacy, taking into account their technical knowledge;  their experience, education and training, as well as the intended context of use of AI systems and the individuals or groups of people in whom those systems are to be used.

This is certainly a critical factor. People who use artificial intelligence must have been given adequate training that allows them to understand the nature of the system and be able to make informed decisions. One of the core principles of European legislation and approach is that of human supervision. Therefore, regardless of the guarantees offered by a given market product, the organization that uses it will always be responsible for the consequences. And this will happen both in the case where the last decision is attributed to a person, and when in highly automated processes those responsible for its management are not able to identify an incident by making appropriate decisions with human supervision.

Guilt in vigilando

The massive introduction of LLMs poses the risk of incurring the so-called culpa in vigilando: a legal principle that refers to the responsibility assumed by a person for not having exercised due vigilance over another, when that lack of control results in damage or harm. If your organization has introduced any of these marketplace products that integrate functions such as reporting, evaluating alphanumeric information, and even assisting you in email management, it will be critical that you ensure compliance with the recommendations outlined above. It is particularly advisable to define very precisely the purposes for which the tool is implemented, the roles and responsibilities of each user, and to document their decisions and to train staff appropriately.

Unfortunately, the model of introduction of LLMs  into the market has itself generated a systemic and serious risk for organizations. Most tools have opted for a marketing strategy that is no different from the one used by social networks in their day. That is, they allow open and free access to anyone. It is obvious that with this they achieve two results: reuse the information provided to them by monetizing the product and generate a culture of use that facilitates the adoption and commercialization of the tool.

Let's imagine a hypothesis, of course, that is far-fetched. A resident intern (MIR) has discovered that several of these tools have been developed and, in fact, are used in another country for differential diagnosis. Our MIR is very worried about having to wake up the head of medical duty in the hospital every 15 minutes. So, diligently, he hires a tool, which has not been planned for that use in Spain, and makes decisions based on the proposal of differential diagnosis of an LLM without yet having the capabilities that enable it for human supervision. Obviously, there is a significant risk of ending up causing harm to a patient.

Situations such as the one described force us to consider how organizations that do not use AI but are aware of the risk that their employees use them without their knowledge or consent should act. In this regard, a preventive strategy should be adopted based on the issuance of very precise circulars and instructions regarding the prohibition of their use. On the other hand, there is a hybrid risk situation. The LLM has been contracted by the organization and is used by the employee for purposes other than those intended. In this case, the safety-training duo acquires a strategic value.

Training and the acquisition of culture about artificial intelligence are probably an essential requirement for society as a whole. Otherwise, the systemic problems and risks that in the past affected the deployment of the Internet will happen again and who knows if with an intensity that is difficult to govern.

Content prepared by Ricard Martínez, Director of the Chair of Privacy and Digital Transformation. Professor, Department of Constitutional Law, Universitat de València. The contents and points of view reflected in this publication are the sole responsibility of its author.

NOTES:

 [1] Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised standards in the field of artificial intelligence and amending Regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and Directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 available in https://eur-lex.europa.eu/legal-content/ES/TXT/?uri=OJ%3AL_202401689  

[2] The RIA defines 'provider' as a natural or legal person, public authority, body or agency that develops an AI system or a general-purpose AI model or for which an AI system or a general-purpose AI model is developed and places it on the market or puts the AI system into service under its own name or brand;  for a fee or free of charge. 

[3] The RIA defines "deployment controller" as a natural or legal person, or public authority, body, office or agency that uses an AI system under its own authority, except where its use is part of a personal activity of a non-professional nature. 

[4] The RIA defines a 'general-purpose AI model' as an AI model, also one trained on a large volume of data using large-scale self-monitoring, which has a considerable degree of generality and is capable of competently performing a wide variety of different tasks, regardless of how the model is introduced to the market.  and that it can be integrated into various downstream systems or applications, except for AI models that are used for research, development, or prototyping activities prior to their introduction to the market.

calendar icon
Blog

We know that the open data managed by the public sector in the exercise of its functions is an invaluable resource for promoting transparency, driving innovation and stimulating economic development. At the global level, in the last 15 years this idea has led to the creation of data portals that serve as  a single point of access for public information both in a country, a region or city.

However, we sometimes find that the full exploitation of the potential of open data is limited by problems inherent in its qualityInconsistencies, lack of standardization or interoperability, and incomplete metadata are just some of the common challenges that sometimes undermine the usefulness of open datasets and that government agencies also point to as the main obstacle to AI adoption.

When we talk about the relationship between open data and artificial intelligence, we almost always start from the same idea: open data feeds AI, that is, it is part of the fuel for models. Whether it's to train foundational models like ALIA, to specialize small language models (SLMs) versus LLMs, or to evaluate and validate their capabilities or explain their behavior (XAI), the argument revolves around the usefulness of open data for artificial intelligence, forgetting that open data was already there and has many other uses.

Therefore, we are going to reverse the perspective and explore how AI itself can become a powerful tool to improve the quality and, therefore, the value of open data itself. This approach, which was already outlined  by the United Nations Economic Commission for Europe (UNECE) in its pioneering  2022 Machine Learning for Official Statistics report  , has become more relevant since the explosion of generative AI. We can now use the artificial intelligence available to increase the quality of datasets that are published throughout their entire lifecycle: from capture and normalization to validation, anonymization, documentation, and follow-up in production.

With this, we can increase the public value of data, contribute to its reuse and amplify its social and economic impact. And, at the same time, to improve the quality of the next generation of artificial intelligence models.

Common challenges in open data quality

Data quality has traditionally been a Critical factor for the success of any open data initiative, which is cited in numerous reports such as that of the European Commission "Improving data publishing by open data portal managers and owners”. The most frequent challenges faced by data publishers include:

  • Inconsistencies and errors: Duplicate data, heterogeneous formats, or outliers are common in datasets. Correcting these small errors, ideally at the data source itself, was traditionally costly and greatly limited the usefulness of many datasets.

  • Lack of standardization and interoperability: Two sets that talk about the same thing may name columns differently, use non-comparable classifications, or lack persistent identifiers to link entities. Without a common minimum, combining sources becomes an artisanal work that makes it more expensive to reuse data.

  • Incomplete or inaccurate metadata: The lack of clear information about the origin, collection methodology, frequency of updating or meaning of the fields, complicates the understanding and use of the data. For example, knowing with certainty if the resource can be integrated into a service, if it is up to date or if there is a point of contact to resolve doubts is very important for its reuse.
  • Outdated or outdated data: In highly dynamic domains such as mobility, pricing, or environmental data, an outdated set can lead to erroneous conclusions. And if there are no versions, changelogs, or freshness indicators, it's hard to know what's changed and why. The absence of a "history" of the data complicates auditing and reduces trust.
  • Inherent biases: sometimes coverage is incomplete, certain populations are underrepresented, or a management practice introduces systematic deviation. If these limits are not documented and warned, analyses can reinforce inequalities or reach unfair conclusions without anyone noticing.

Where Artificial Intelligence Can Help

Fortunately, in its current state, artificial intelligence is already in a position to provide a set of tools that can help address some of these open data quality challenges, transforming your management from a manual and error-prone process to a more automated and efficient one:

  • Automated error detection and correction: Machine learning algorithms and AI models can automatically and reliably identify inconsistencies, duplicates, outliers, and typos in large volumes of data. In addition, AI can help normalize and standardize data, transforming it for example into common formats and schemas to facilitate interoperability (such as DCAT-AP), and at a fraction of the cost it was so far.
  • Metadata enrichment and cataloging: Technologies associated with natural language processing (NLP), including the use of large language models (LLMs) and small language models (SLMs), can help analyze descriptions and generate more complete and accurate metadata. This includes tasks such as suggesting relevant tags, classification categories, or extracting key entities (place names, organizations, etc.) from textual descriptions to enrich metadata.
  • Anonymization and privacy: When open data contains information that could affect privacy, anonymization becomes a critical, but sometimes costly, task. Artificial Intelligence can contribute to making anonymization much more robust and to minimize risks related to re-identification by combining different data sets.

Bias assessment: AI can analyze the open datasets themselves for representation or historical biases. This allows publishers to take steps to correct them or at least warn users about their presence so that they are taken into account when they are to be reused. In short, artificial intelligence should not only be seen as a "consumer" of open data, but also as a strategic ally to improve its quality. When integrated with standards, processes, and human oversight, AI helps detect and explain incidents, better document sets, and publish trust-building quality evidence. As described in the 2024 Artificial Intelligence Strategy, this synergy unlocks more public value: it facilitates innovation, enables better-informed decisions, and consolidates a more robust and reliable open data ecosystem with more useful, more reliable open data with greater social impact.

In addition, a virtuous cycle is activated: higher quality open data trains more useful and secure models; and more capable models make it easier to continue raising the quality of data. In this way, data management is no longer a static task of publication and becomes a dynamic process of continuous improvement.

Content created by Jose Luis Marín, Senior Consultant in Data, Strategy, Innovation & Digitalisation. The content and views expressed in this publication are the sole responsibility of the author.

calendar icon
Blog

Artificial intelligence (AI) has become a central technology in people's lives and in the strategy of companies. In just over a decade, we've gone from interacting with virtual assistants that understood simple commands, to seeing systems capable of writing entire reports, creating hyper-realistic images, or even writing code.

This visible leap has made many wonder: is it all the same? What is the difference between what we already knew as AI and this new "Generative AI" that is so much talked about?

In this article we are going to organize those ideas and explain, with clear examples, how "Traditional" AI and Generative AI fit  under the great umbrella of artificial intelligence.

Traditional AI: analysis and prediction

For many years, what we understood by AI was closer to what we now call "Traditional AI". These systems are characterized by solving concrete, well-defined problems within a framework of available rules or data.

Some practical examples:

  • Recommendation engines: Spotify suggests songs based on your listening history and Netflix adjusts its catalog to your personal tastes, generating up to 80% of views on the platform.

  • Prediction systems: Walmart uses predictive models to anticipate the demand for products based on factors such as weather or local events; Red Eléctrica de España applies similar algorithms to forecast electricity consumption and balance the grid.

  • Automatic recognition: Google Photos classifies images by recognizing faces and objects; Visa and Mastercard use anomaly detection models to identify fraud in real time; Tools like Otter.ai automatically transcribe meetings and calls.

In all these cases, the models learn from past data to provide a classification, prediction, or decision. They do not invent anything new, but recognize patterns and apply them to the future.

Generative AI: content creation

The novelty of generative AI is that it not only analyzes, but also produces (generates) from the data it has.

In practice, this means that:

  • You can generate structured text from a couple of initial ideas.

  • You can combine existing visual elements from a written description.

  • You can create product prototypes, draft presentations, or propose code snippets based on learned patterns.

The key is that generative models don't just classify or predict, they generate new combinations based on what they learned during their training.

The impact of this breakthrough is enormous: in the development world, GitHub Copilot already includes agents that detect and fix programming errors on their own; in design, Google's Nano Banana  tool  promises to revolutionize image editing with an efficiency that could render programs like Photoshop obsolete; and in music, entirely AI-created bands like Velvet Velvet Sundown they already exceed one million monthly listeners on Spotify, with songs, images and biography fully generated, without real musicians behind them.

When is it best to use each type of AI?

The choice between Traditional and Generative AI is not a matter of fashion, but of what specific need you want to solve. Each shines in different situations:

Traditional AI: the best option when...

  • You need to predict future behaviors based on historical data (sales, energy consumption, predictive maintenance).

  • You want to detect anomalies or classify information accurately (transaction fraud, imaging, spam).

  • You are looking  to optimize processes to gain efficiency (logistics, transport routes, inventory management).

  • You work in critical environments where reliability and accuracy are a must (health, energy, finance).

Use it when the goal is to make decisions based on real data with the highest possible accuracy.

Generative AI: the best option when...

  • You need to create content (texts, images, music, videos, code).

  • You want to prototype or experiment quickly, exploring different scenarios before deciding (product design, R+D testing).

  • You are looking for more natural interaction with users (chatbots, virtual assistants, conversational interfaces).

  • You require large-scale personalization, generating messages or materials adapted to each individual (marketing, training, education).

  • You are interested in simulating scenarios that you cannot easily obtain with real data (fictitious clinical cases, synthetic data to train other models).

Use it when the goal is to create, personalize, or interact in a more human and flexible way.

An example from the health field illustrates this well:

  • Traditional AI can analyze thousands of clinical records to anticipate the likelihood of a patient developing a disease.

  • Generative AI can create fictional scenarios to train medical students, generating realistic clinical cases without exposing real patient data.

Do they compete or complement each other?

In 2019, Gartner introduced the concept of Composite AI to describe hybrid solutions that combined different AI approaches to solve a problem more comprehensively. Although it was a term that was not very widespread then, today it is more relevant than ever thanks to the emergence of Generative AI.

Generative AI does not replace Traditional AI, but rather complements it. When you integrate both approaches into a single workflow, you achieve much more powerful results than if you used each technology separately.

Although, according to Gartner, Composite AI is still in the Innovation Trigger phase, where an emerging technology begins to generate interest, and although its practical use is still limited, we already see many new trends being generated in multiple sectors:

  • In retail: A traditional system predicts how many orders a store will receive next week, and generative AI automatically generates personalized product descriptions for customers of those orders.

  • In education: a traditional model assesses student progress and detects weak areas, while generative AI designs exercises or materials tailored to those needs.

  • In industrial design: a traditional algorithm optimizes manufacturing logistics, while a generative AI proposes prototypes of new parts or products.

Ultimately, instead of questioning which type of AI is more advanced, the right thing to do is to ask: what problem do I want to solve, and which AI approach is right for it?

Content created by Juan Benavente, senior industrial engineer and expert in technologies related to the data economy. The content and views expressed in this publication are the sole responsability of the author. 

calendar icon
Blog

Artificial intelligence (AI) assistants are already part of our daily lives: we ask them the time, how to get to a certain place or we ask them to play our favorite song. And although AI, in the future, may offer us infinite functionalities, we must not forget that linguistic diversity is still a pending issue.

In Spain, where Spanish coexists with co-official languages such as Basque, Catalan, Valencian and Galician, this issue is especially relevant. The survival and vitality of these languages in the digital age depends, to a large extent, on their ability to adapt and be present in emerging technologies. Currently, most virtual assistants, automatic translators or voice recognition systems do not understand all the co-official languages. However, did you know that there are collaborative projects to ensure linguistic diversity?

In this post we tell you about the approach and the greatest advances of some initiatives that are building the digital foundations necessary for the co-official languages in Spain to also thrive in the era of artificial intelligence.

ILENIA, the coordinator of multilingual resource initiatives in Spain

The models that we are going to see in this post share a focus because they are part of ILENIA, a state-level coordinator that connects the individual efforts of the autonomous communities. This initiative brings together the projects BSC-CNS (AINA), CENID (VIVES), HiTZ (NEL-GAITU) and the University of Santiago de Compostela (NÓS), with the aim of generating digital resources that allow the development of multilingual applications in the different languages of Spain.

The success of these initiatives depends fundamentally on citizen participation. Through platforms such as  Mozilla's Common Voice, any speaker can contribute to the construction of these linguistic resources through different forms of collaboration:

  • Spoken Read: Collecting different ways of speaking through voice donations of a specific text.
  • Spontaneous speech: creates  real and organic datasets as a result of conversations with prompts.
  • Text in language: collaborate in the transcription of audios or in the contribution of textual content, suggesting new phrases or questions to enrich the corpora.

All resources are published under free licenses such as CC0, allowing them to be used free of charge by researchers, developers and companies.

The challenge of linguistic diversity in the digital age

Artificial intelligence systems learn from the data they receive during their training. To develop technologies that work correctly in a specific language, it is essential to have large volumes of data: audio recordings, text corpora and examples of real use of the language.

In other publications of datos.gob.es we have addressed the functioning of foundational models and initiatives in Spanish such as ALIA, trained with large corpus of text such as those of the Royal Spanish Academy.

Both posts explain why language data collection is not a cheap or easy task. Technology companies have invested massively in compiling these resources for languages with large numbers of speakers, but Spanish co-official languages face a structural disadvantage. This has led to many models not working properly or not being available in Valencian, Catalan, Basque or Galician.

However, there are collaborative and open data initiatives that allow the creation of quality language resources. These are the projects that several autonomous communities have launched, marking the way towards a multilingual digital future.

On the one hand, the Nós en Galicia Project creates oral and conversational resources in Galician with all the accents and dialectal variants to facilitate integration through tools such as GPS, voice assistants or ChatGPT. A similar purpose is that of Aina in Catalonia, which also offers an academic platform and a laboratory for developers or Vives in the Valencian Community. In the Basque Country there is also the Euskorpus project  , which aims to constitute a quality text corpus in Basque. Let's look at each of them.

Proyecto Nós, a collaborative approach to digital Galician

The project has already developed three operational tools: a multilingual neural translator, a speech recognition system that converts speech into text, and a speech synthesis application. These resources are published under open licenses, guaranteeing their free and open access for researchers, developers and companies. These are its main features:

  • Promoted by: the Xunta de Galicia and the University of Santiago de Compostela.
  • Main objective: to create oral and conversational resources in Galician that capture the dialectal and accent diversity of the language.
  • How to participate: The project accepts voluntary contributions both by reading texts and by answering spontaneous questions.

Aina, towards an AI that understands and speaks Catalan

With a similar approach to the Nós project, Aina seeks to facilitate the integration of Catalan into artificial intelligence language models.

It is structured in two complementary aspects that maximize its impact:

  • Aina Tech focuses on facilitating technology transfer to the business sector, providing the necessary tools to automatically translate websites, services and online businesses into Catalan.
  • Aina Lab promotes the creation of a community of developers through initiatives such as Aina Challenge, promoting collaborative innovation in Catalan language technologies. Through this call , 22 proposals have already been selected with a total amount of 1 million to execute their projects.

The characteristics of the project are:

Vives, the collaborative project for AI in Valencian

On the other hand, Vives collects voices speaking in Valencian to serve as training for AI models.

Gaitu: strategic investment in the digitalisation of the Basque language

In Basque, we can highlight Gaitu,  which seeks  to collect voices speaking in Basque in order to train AI models. Its characteristics are:

Benefits of Building and Preserving Multilingual Language Models

The digitization projects of the co-official languages transcend the purely technological field to become tools for digital equity and cultural preservation. Its impact is manifested in multiple dimensions:

  • For citizens: these resources ensure that speakers of all ages and levels of digital competence can interact with technology in their mother tongue, removing barriers that could exclude certain groups from the digital ecosystem.
  • For the business sector: the availability of open language resources makes it easier for companies and developers to create products and services in these languages without assuming the high costs traditionally associated with the development of language technologies.
  • For the research fabric, these corpora constitute a fundamental basis for the advancement of research in natural language processing and speech technologies, especially relevant for languages with less presence in international digital resources.

The success of these initiatives shows that it is possible to build a digital future where linguistic diversity is not an obstacle but a strength, and where technological innovation is put at the service of the preservation and promotion of linguistic cultural heritage.

calendar icon
Blog

Over the last few years we have seen spectacular advances in the use of artificial intelligence (AI) and, behind all these achievements, we will always find the same common ingredient: data. An illustrative example known to everyone is that of the language models used by OpenAI for its famous ChatGPT, such as GPT-3, one of its first models that was trained with more than 45 terabytes of data, conveniently organized and structured to be useful.

Without sufficient availability of quality and properly prepared data, even the most advanced algorithms will not be of much use, neither socially nor economically. In fact, Gartner estimates that more than 40% of emerging AI agent projects today will end up being abandoned in the medium term due to a lack of adequate data and other quality issues. Therefore, the effort invested in standardizing, cleaning, and documenting data can make the difference between a successful AI initiative and a failed experiment. In short, the classic principle of "garbage in, garbage out" in computer engineering applied this time to artificial intelligence: if we feed an AI with low-quality data, its results will be equally poor and unreliable.

Becoming aware of this problem arises the concept of "AI Data Readiness" or preparation of data to be used by artificial intelligence. In this article, we'll explore what it means for data to be "AI-ready", why it's important, and what we'll need for AI algorithms to be able to leverage our data effectively. This results in greater social value, favoring the elimination of biases and the promotion of equity.

What does it mean for data to be "AI-ready"?

Having AI-ready data means that this data meets a series of technical, structural, and quality requirements that optimize its use by artificial intelligence algorithms. This includes multiple aspects such as the completeness of the data, the absence of errors and inconsistencies, the use of appropriate formats, metadata and homogeneous structures, as well as providing the  necessary context to be able to verify that they are aligned with the use that AI will give them.

Preparing data for AI often requires a multi-stage process. For example, again the consulting firm Gartner recommends following the following steps:

  1. Assess data needs according to the use case: identify which data is relevant to the problem we want to solve with AI (the type of data, volume needed, level of detail, etc.), understanding that this assessment can be an iterative process that is refined as the AI project progresses.
  2. Align business areas and get management support: present data requirements to managers based on identified needs and get their backing, thus securing the resources required to prepare the data properly.
  3. Develop good data governance practices: implement appropriate data management policies and tools (quality, catalogs, data lineage, security, etc.) and ensure that they also incorporate the needs of AI projects.
  4. Expand the data ecosystem: integrate new data sources, break down potential barriers and silos that are working in isolation within the organization and adapt the infrastructure to be able to handle the large volumes and variety of data necessary for the proper functioning of AI.
  5. Ensure scalability and regulatory compliance: ensure that data management can scale as AI projects grow, while maintaining a robust governance framework in line with the necessary ethical protocols and compliance with existing regulations.

If we follow a strategy like this one, we will be able to integrate the new requirements and needs of AI into our usual data governance practices. In essence, it is simply a matter of ensuring that our data is prepared to feed AI models with the minimum possible friction, avoiding possible setbacks later in the day during the development of projects.

Open data "ready for AI"

In the field of open science and open data, the FAIR principles have been promoted for years. These acronyms state that data must be locatable, accessible, interoperable and reusable. The FAIR principles have served to guide the management of scientific and open data to make them more useful and improve their use by the scientific community and society at large. However, these principles were not designed to address the new needs associated with the rise of AI.

Therefore, the proposal is currently being made  to extend the original principles by  adding a fifth readiness principle for AI, thus moving from the initial FAIR to FAIR-R or FAIR². The aim would be precisely to make explicit those additional attributes that make the data ready to accelerate its responsible and transparent use as a necessary tool for AI applications of high public interest

FAIR-R Principles: Findable, Accessible, Interoperable, Reusable, Readness. Source: own elaboration - datos.gob.es

What exactly would this new R add to the FAIR principles? In essence, it emphasizes some aspects such as:

  • Labelling, annotation and adequate enrichment of data.
  • Transparency on the origin, lineage and processing of data.
  • Standards, metadata, schemas and formats optimal for use by AI.
  • Sufficient coverage and quality to avoid bias or lack of representativeness.

In the context of open data, this discussion is especially relevant within the discourse of the "fourth wave" of the open data movement, through which it is argued that if governments, universities and other institutions release their data, but it is not in the optimal conditions to be able to feed the algorithms,  A unique opportunity for a whole new universe of innovation and social impact would be missing: improvements in medical diagnostics, detection of epidemiological outbreaks, optimization of urban traffic and transport routes, maximization of crop yields or prevention of deforestation are just a few examples of the possible lost opportunities.

And if not, we could also enter a long "data winter", where positive AI applications are constrained by poor-quality, inaccessible, or biased datasets. In that scenario, the promise of AI for the common good would be frozen, unable to evolve due to a lack of adequate raw material, while AI applications led by initiatives with private interests would continue to advance and increase unequal access to the benefit provided by technologies.

Conclusion: the path to quality, inclusive AI with true social value

We can never take for granted the quality or suitability of data for new AI applications: we must continue to evaluate it, work on it and carry out its governance in a rigorous and effective way in the same way as it has been recommended for other applications. Making our data AI-ready is therefore not a trivial task, but the long-term benefits are clear: more accurate algorithms, reduced unwanted bias, increased transparency of AI, and extended its benefits to more areas in an equitable way.

Conversely, ignoring data preparation carries a high risk of failed AI projects, erroneous conclusions, or exclusion of those who do not have access to quality data. Addressing the unfinished business on how to prepare and share data responsibly is essential to unlocking the full potential of AI-driven innovation for the common good. If quality data is the foundation for the promise of more humane and equitable AI, let's make sure we build a strong enough foundation to be able to reach our goal.

On this path towards a more inclusive artificial intelligence, fuelled by quality data and with real social value, the European Union is also making steady progress. Through initiatives such as its Data Union strategy, the creation of common data spaces in key sectors such as health, mobility or agriculture, and the promotion of the so-called AI Continent and AI factories, Europe seeks to build a digital infrastructure where data is governed responsibly, interoperable and prepared to be used by AI systems for the benefit of the common good. This vision not only promotes greater digital sovereignty but reinforces the principle that public data should be used to develop technologies that serve people and not the other way around.


Content prepared by Carlos Iglesias, Open data Researcher and consultant, World Wide Web Foundation. The contents and views reflected in this publication are the sole responsibility of the author.

calendar icon
Blog

In the usual search for tricks to make our prompts more effective, one of the most popular is the activation of the chain of thought. It consists of posing a multilevel problem and asking the AI system to solve it, but not by giving us the solution all at once, but by making visible step by step the logical line necessary to solve it. This feature is available in both paid and free AI systems, it's all about knowing how to activate it.

Originally, the reasoning string was one of many tests of semantic logic that developers put language models through. However, in 2022, Google Brain researchers demonstrated for the first time that providing examples of chained reasoning in the prompt could unlock greater problem-solving capabilities in models.

From this moment on, little by little, it has positioned itself as a useful technique to obtain better results from use, being very questioned at the same time from a technical point of view. Because what is really striking about this process is that language models do not think in a chain: they are only simulating human reasoning before us.

How to activate the reasoning chain

There are two possible ways to activate this process in the models: from a button provided by the tool itself, as in the case of DeepSeek with the "DeepThink" button that activates the R1 model:

 

Graphical User Interface, Application</p>
<p>AI-generated content may be incorrect.

Figure 1. DeepSeek with the "DeepThink" button that activates the R1 model.

Or, and this is the simplest and most common option, from the prompt itself. If we opt for this option, we can do it in two ways: only with the instruction (zero-shot prompting) or by providing solved examples (few-shot prompting).

  • Zero-shot prompting: as simple as adding at the end of the prompt an instruction such as "Reason step by step", or "Think before answering". This assures us that the chain of reasoning will be activated and we will see the logical process of the problem visible.

Graphical User Interface, Text, Application</p>
<p>AI-generated content may be incorrect.

Figure 2. Example of Zero-shot prompting.

  • Few-shot prompting: if we want a very precise response pattern, it may be interesting to provide some solved question-answer examples. The model sees this demonstration and imitates it as a pattern in a new question.​

Text, Application, Letter</p>
<p>AI-generated content may be incorrect.

Figure 3. Example of Few-shot prompting.

Benefits and three practical examples

When we activate the chain of reasoning, we are asking the system to "show" its work in a visible way before our eyes, as if it were solving the problem on a blackboard. Although not completely eliminated, forcing the language model to express the logical steps reduces the possibility of errors, because the model focuses its attention on one step at a time. In addition, in the event of an error, it is much easier for the user of the system to detect it with the naked eye.

When is the chain of reasoning useful? Especially in mathematical calculations, logical problems, puzzles, ethical dilemmas or questions with different stages and jumps (called multi-hop). In the latter, it is practical, especially in those in which you have to handle information from the world that is not directly included in the question.

Let's see some examples in which we apply this technique to a chronological problem, a spatial problem and a probabilistic problem.

  • Chronological reasoning

Let's think about the following prompt:

If Juan was born in October and is 15 years old, how old was he in June of last year?

Graphical User Interface, Text, Application</p>
<p>AI-generated content may be incorrect.

Figure 5. Example of chronological reasoning.

For this example we have used the GPT-o3 model, available in the Plus version of ChatGPT and specialized in reasoning, so the chain of thought is activated as standard and it is not necessary to do it from the prompt. This model is programmed to give us the information of the time it has taken to solve the problem, in this case 6 seconds. Both the answer and the explanation are correct, and to arrive at them the model has had to incorporate external information such as the order of the months of the year, the knowledge of the current date to propose the temporal anchorage, or the idea that age changes in the month of the birthday, and not at the beginning of the year.

  • Spatial reasoning

  • A person is facing north. Turn 90 degrees to the right, then 180 degrees to the left. In what direction are you looking now?

    Graphical User Interface, Text, Application, Email</p>
<p>AI-generated content may be incorrect.

    Figure 6. Example of spatial reasoning.

    This time we have used the free version of ChatGPT, which uses the GPT-4o model by default (although with limitations), so it is safer to activate the reasoning chain with an indication at the end of the prompt: Reason step by step. To solve this problem, the model needs general knowledge of the world that it has learned in training, such as the spatial orientation of the cardinal points, the degrees of rotation, laterality and the basic logic of movement.

  • Probabilistic reasoning

  • In a bag there are 3 red balls, 2 green balls and 1 blue ball. If you draw a ball at random without looking, what's the probability that it's neither red nor blue?

    Text</p>
<p>AI-generated content may be incorrect.

    Figure 7. Example of probabilistic reasoning.

    To launch this prompt we have used Gemini 2.5 Flash, in the Gemini Pro version of Google. The training of this model was certainly included in the fundamentals of both basic arithmetic and probability, but the most effective for the model to learn to solve this type of exercise are the millions of solved examples it has seen. Probability problems and their step-by-step solutions are the model to imitate when reconstructing this reasoning.

    The Great Simulation

    And now, let's go with the questioning. In recent months, the debate about whether or not we can trust these mock explanations has grown, especially since, ideally, the chain of thought should faithfully reflect the internal process by which the model arrives at its answer. And there is no practical guarantee that this will be the case.

    The Anthropic team (creators of Claude, another great language model) has carried out a trap experiment with Claude Sonnet in 2025, to which they suggested a key clue for the solution before activating the reasoned response.

    Think of it like passing a student a note that says "the answer is [A]" before an exam. If you write on your exam that you chose [A] at least in part because of the grade, that's good news: you're being honest and faithful. But if you write down what claims to be your reasoning process without mentioning the note, we might have a problem.

    The percentage of times Claude Sonnet included the track among his deductions was only 25%. This shows that sometimes models generate explanations that sound convincing, but that do not correspond to their true internal logic to arrive at the solution, but are rationalizations a posteriori: first they find the solution, then they invent the process in a coherent way for the user. This shows the risk that the model may be hiding steps or relevant information for the resolution of the problem.

    Closing

    Despite the limitations exposed, as we see in the study mentioned above, we cannot forget that in the original Google Brain research, it was documented that, when applying the reasoning chain, the PaLM model improved its performance in mathematical problems from 17.9% to 58.1% accuracy. If, in addition, we combine this technique with the search in open data to obtain information external to the model, the reasoning improves in terms of being more verifiable, updated and robust.

    However, by making language models "think out loud", what we are really improving in 100% of cases is the user experience in complex tasks. If we do not fall into the excessive delegation of thought to AI, our own cognitive process can benefit. It is also a technique that greatly facilitates our new work as supervisors of automatic processes.


Content prepared by Carmen Torrijos, expert in AI applied to language and communication. The contents and points of view reflected in this publication are the sole responsibility of the author.

calendar icon
Blog

Sport has always been characterized by generating a lot of data, statistics, graphs... But accumulating figures is not enough. It is necessary to analyze the data, draw conclusions and make decisions based on it. The advantages of sharing data in this sector go beyond mere sports, having a positive impact on health and the economic sphere. they go beyond mere sports, having a positive impact on health and the economic sphere.

Artificial intelligence (AI) has also reached the professional sports sector and its ability to process huge amounts of data has opened the door to making the most of the potential of all that information. Manchester City, one of the best-known football clubs in the British Premier League, was one of the pioneers in using artificial intelligence to improve its sporting performance:  it uses AI algorithms for the selection of new talent and has collaborated in the development of WaitTime, an artificial intelligence platform that manages the attendance of crowds in large sports and leisure venues. In Spain, Real Madrid, for example, incorporated the use of artificial intelligence a few years ago and promotes forums on the impact of AI on sport.

Artificial intelligence systems analyze extensive volumes of data collected during training and competitions, and are able to provide detailed evaluations on the effectiveness of strategies and optimization opportunities. In addition, it is possible to develop alerts on injury risks, allowing prevention measures to be established, or to create personalized training plans that are automatically adapted to each athlete according to their individual needs. These tools have completely changed contemporary high-level sports preparation. In this post we are going to review some of these use cases.

From simple observation to complete data management to optimize results

Traditional methods of sports evaluation have evolved into highly specialized technological systems. Artificial intelligence and machine learning tools process massive volumes of information during training and competitions, converting statistics, biometric data and audiovisual content into  strategic insights for the management of athletes' preparation and health.

Real-time performance analysis systems  are one of the most established implementations in the sports sector. To collect this data, it is common to see athletes training with bands or vests that monitor different parameters in real time. Both these and other devices and sensors record movements, speeds and biometric data. Heart rate, speed or acceleration are some of the most common data. AI algorithms process this information, generating immediate results that help optimize personalized training programs for each athlete and tactical adaptations, identifying patterns to locate areas for improvement.

In this sense, sports artificial intelligence platforms evaluate both individual performance and collective dynamics in the case of team sports. To evaluate the tactical area, different types of data are analyzed according to the sports modality. In endurance disciplines, speed, distance, rhythm or power are examined, while in team sports data on the position of the players or the accuracy of passes or shots are especially relevant.

Another advance is AI cameras, which allow you to follow the trajectory of players on the field and the movements of different elements, such as the ball in ball sports. These systems generate a multitude of data on positions, movements and patterns of play. The analysis of these historical data sets allows us to identify strategic strengths and vulnerabilities both our own and those of our opponents. This helps to generate different tactical options and improve decision-making before a competition.

Health and well-being of athletes

Sports injury prevention systems analyze historical data and metrics in real-time. Its algorithms identify injury risk patterns, allowing personalized preventive measures to be taken for each athlete. In the case of football, teams such as Manchester United, Liverpool, Valencia CF and Getafe CF have been implementing these technologies for several years.

In addition to the data we have seen above, sports monitoring platforms also record physiological variables continuously: heart rate, sleep patterns, muscle fatigue and movement biomechanics. Wearable devices  with artificial intelligence capabilities detect indicators of fatigue, imbalances, or physical stress that precede injuries. With this data, the algorithms predict patterns that detect risks and make it easier to act preventively, adjusting training or developing specific recovery programs before an injury occurs. In this way, training loads, rep volume, intensity and recovery periods can be calibrated according to individual profiles. This predictive maintenance for athletes is especially relevant for teams and clubs in which athletes are not only sporting assets, but also economic ones. In addition, these systems also optimise sports rehabilitation processes, reducing recovery times in muscle injuries by up to 30% and providing predictions on the risk of relapse.

While not foolproof, the data indicates that these platforms predict approximately 50% of injuries during sports seasons, although they cannot predict when they will occur. The application of AI to healthcare in sport thus contributes to the extension of professional sports careers, facilitating optimal performance and the athlete's athletic well-being in the long term.

Improving the audience experience

Artificial intelligence is also revolutionizing the way fans enjoy sport, both in stadiums and at home. Thanks to natural language processing (NLP) systems, viewers can follow comments and subtitles in real time, facilitating access for people with hearing impairments or speakers of other languages. Manchester City has recently incorporated this technology for the generation of real-time subtitles on the screens of its stadium. These applications have also reached other sports disciplines: IBM Watson has developed a functionality that allows Wimbledon fans to watch the videos with highlighted commentary and AI-generated subtitles.

In addition, AI optimises the management of large capacities through sensors and predictive algorithms, speeding up access, improving security and customising services such as seat locations. Even in broadcasts, AI-powered tools offer instant statistics,  automated highlights, and smart cameras that follow the action without human intervention, making the experience more immersive and dynamic. The NBA uses Second Spectrum, a system that combines cameras with AI to analyze player movements and create visualizations, such as passing routes or shot probabilities. Other sports, such as golf or Formula 1, also use similar tools that enhance the fan experience.

Data privacy and other challenges

The application of AI in sport also poses significant ethical challenges. The collection and analysis of biometric information raises doubts about the security and protection of athletes' personal data, so it is necessary to establish protocols that guarantee the management of consent, as well as the ownership of such data.

Equity  is another concern, as the application of artificial intelligence gives competitive advantages to teams and organizations with greater economic resources, which can contribute to perpetuating inequalities.

Despite these challenges, artificial intelligence has radically transformed the professional sports landscape. The future of sport seems to be linked to the evolution of this technology. Its application promises to continue to elevate athlete performance and the public experience, although some challenges need to be overcome.

calendar icon