Blog

Over the last few years we have seen spectacular advances in the use of artificial intelligence (AI) and, behind all these achievements, we will always find the same common ingredient: data. An illustrative example known to everyone is that of the language models used by OpenAI for its famous ChatGPT, such as GPT-3, one of its first models that was trained with more than 45 terabytes of data, conveniently organized and structured to be useful.

Without sufficient availability of quality and properly prepared data, even the most advanced algorithms will not be of much use, neither socially nor economically. In fact, Gartner estimates that more than 40% of emerging AI agent projects today will end up being abandoned in the medium term due to a lack of adequate data and other quality issues. Therefore, the effort invested in standardizing, cleaning, and documenting data can make the difference between a successful AI initiative and a failed experiment. In short, the classic principle of "garbage in, garbage out" in computer engineering applied this time to artificial intelligence: if we feed an AI with low-quality data, its results will be equally poor and unreliable.

Becoming aware of this problem arises the concept of "AI Data Readiness" or preparation of data to be used by artificial intelligence. In this article, we'll explore what it means for data to be "AI-ready", why it's important, and what we'll need for AI algorithms to be able to leverage our data effectively. This results in greater social value, favoring the elimination of biases and the promotion of equity.

What does it mean for data to be "AI-ready"?

Having AI-ready data means that this data meets a series of technical, structural, and quality requirements that optimize its use by artificial intelligence algorithms. This includes multiple aspects such as the completeness of the data, the absence of errors and inconsistencies, the use of appropriate formats, metadata and homogeneous structures, as well as providing the  necessary context to be able to verify that they are aligned with the use that AI will give them.

Preparing data for AI often requires a multi-stage process. For example, again the consulting firm Gartner recommends following the following steps:

  1. Assess data needs according to the use case: identify which data is relevant to the problem we want to solve with AI (the type of data, volume needed, level of detail, etc.), understanding that this assessment can be an iterative process that is refined as the AI project progresses.
  2. Align business areas and get management support: present data requirements to managers based on identified needs and get their backing, thus securing the resources required to prepare the data properly.
  3. Develop good data governance practices: implement appropriate data management policies and tools (quality, catalogs, data lineage, security, etc.) and ensure that they also incorporate the needs of AI projects.
  4. Expand the data ecosystem: integrate new data sources, break down potential barriers and silos that are working in isolation within the organization and adapt the infrastructure to be able to handle the large volumes and variety of data necessary for the proper functioning of AI.
  5. Ensure scalability and regulatory compliance: ensure that data management can scale as AI projects grow, while maintaining a robust governance framework in line with the necessary ethical protocols and compliance with existing regulations.

If we follow a strategy like this one, we will be able to integrate the new requirements and needs of AI into our usual data governance practices. In essence, it is simply a matter of ensuring that our data is prepared to feed AI models with the minimum possible friction, avoiding possible setbacks later in the day during the development of projects.

Open data "ready for AI"

In the field of open science and open data, the FAIR principles have been promoted for years. These acronyms state that data must be locatable, accessible, interoperable and reusable. The FAIR principles have served to guide the management of scientific and open data to make them more useful and improve their use by the scientific community and society at large. However, these principles were not designed to address the new needs associated with the rise of AI.

Therefore, the proposal is currently being made  to extend the original principles by  adding a fifth readiness principle for AI, thus moving from the initial FAIR to FAIR-R or FAIR². The aim would be precisely to make explicit those additional attributes that make the data ready to accelerate its responsible and transparent use as a necessary tool for AI applications of high public interest

FAIR-R Principles: Findable, Accessible, Interoperable, Reusable, Readness. Source: own elaboration - datos.gob.es

What exactly would this new R add to the FAIR principles? In essence, it emphasizes some aspects such as:

  • Labelling, annotation and adequate enrichment of data.
  • Transparency on the origin, lineage and processing of data.
  • Standards, metadata, schemas and formats optimal for use by AI.
  • Sufficient coverage and quality to avoid bias or lack of representativeness.

In the context of open data, this discussion is especially relevant within the discourse of the "fourth wave" of the open data movement, through which it is argued that if governments, universities and other institutions release their data, but it is not in the optimal conditions to be able to feed the algorithms,  A unique opportunity for a whole new universe of innovation and social impact would be missing: improvements in medical diagnostics, detection of epidemiological outbreaks, optimization of urban traffic and transport routes, maximization of crop yields or prevention of deforestation are just a few examples of the possible lost opportunities.

And if not, we could also enter a long "data winter", where positive AI applications are constrained by poor-quality, inaccessible, or biased datasets. In that scenario, the promise of AI for the common good would be frozen, unable to evolve due to a lack of adequate raw material, while AI applications led by initiatives with private interests would continue to advance and increase unequal access to the benefit provided by technologies.

Conclusion: the path to quality, inclusive AI with true social value

We can never take for granted the quality or suitability of data for new AI applications: we must continue to evaluate it, work on it and carry out its governance in a rigorous and effective way in the same way as it has been recommended for other applications. Making our data AI-ready is therefore not a trivial task, but the long-term benefits are clear: more accurate algorithms, reduced unwanted bias, increased transparency of AI, and extended its benefits to more areas in an equitable way.

Conversely, ignoring data preparation carries a high risk of failed AI projects, erroneous conclusions, or exclusion of those who do not have access to quality data. Addressing the unfinished business on how to prepare and share data responsibly is essential to unlocking the full potential of AI-driven innovation for the common good. If quality data is the foundation for the promise of more humane and equitable AI, let's make sure we build a strong enough foundation to be able to reach our goal.

On this path towards a more inclusive artificial intelligence, fuelled by quality data and with real social value, the European Union is also making steady progress. Through initiatives such as its Data Union strategy, the creation of common data spaces in key sectors such as health, mobility or agriculture, and the promotion of the so-called AI Continent and AI factories, Europe seeks to build a digital infrastructure where data is governed responsibly, interoperable and prepared to be used by AI systems for the benefit of the common good. This vision not only promotes greater digital sovereignty but reinforces the principle that public data should be used to develop technologies that serve people and not the other way around.


Content prepared by Carlos Iglesias, Open data Researcher and consultant, World Wide Web Foundation. The contents and views reflected in this publication are the sole responsibility of the author.

calendar icon
Blog

Energy is the engine of our society, a vital resource that powers our lives and the global economy. However, the traditional energy model faces monumental challenges: growing demand, climate urgency, and the prevailing need for a transition to cleaner and more sustainable sources. In this panorama of profound transformation, a silent but powerful actor emerges: data. Not only is "having data" important, but also the ability to govern it properly to transform the energy sector.

In this new energy paradigm, data has become a strategic resource as essential as energy itself. The key is not only in generating and distributing electricity, but in understanding, anticipating and optimizing its use in real time. And to do this, it is necessary to capture the digital pulse of the energy system through millions of measurement and observation points.

So, before addressing how this data is governed, it's worth understanding where it comes from, what kind of information it generates, and how it's quietly transforming how the power grid works.

The digital heartbeat of the network: data from smart meters and sensors

Imagine an electric grid that not only distributes power, but also "listens," "learns," and "reacts." This is the promise of smart grids, a  system that goes far beyond the cables and transformers we see. Asmart grid is an electricity distribution system that uses digital technology to improve grid efficiency, sustainability, and security. At the heart of this revolution are smart meters and a vast network of sensors.

Smart meters, also known as Advanced Metering Infrastructure (AMI), are devices that record electricity consumption digitally, often at very short intervals of time (e.g. every 15 minutes or per hour), and transmit this data to power companies via various communication technologies, such as cellular networks, WiFi,  PLC (Power Line Communication) or radio frequency (RF). This data is not limited to the total amount of energy consumed, but offers a detailed breakdown of consumption patterns, voltage levels, power quality, and even fault detection.

But network intelligence doesn't just lie with the meters. A myriad of sensors distributed throughout the electrical infrastructure monitor critical variables in real time: from transformer temperature and equipment status to environmental conditions and power flow at different points on the grid. These sensors act as the "eyes and ears" of the system, providing a granular and dynamic view of network performance.

The magic happens in the flow of this data. The information from the meters and sensors travels bidirectionally: from the point of consumption or generation to the management platforms of the electricity company and vice versa. This constant communication allows utilities to:

  • Check-in accurately
  • Implement demand response programs
  • Optimize power distribution
  • Predict and prevent disruptions
  • Efficiently integrate renewable energy sources that are intermittent by their nature

Data governance: the backbone of a connected network

The mere collection of data, however abundant, does not guarantee its value. In fact, without proper management, this heterogeneity of sources can become an insurmountable barrier to the integration and useful analysis of information. This is where data governance comes into play.

Data governance  in the context of smart grids involves establishing a robust set of principles, processes, roles, and technologies to ensure that the data generated is reliable, accessible, useful, and secure. It's the "rule of the game" that defines how data is captured, stored, maintained, used, protected, and deleted throughout its entire lifecycle.

Why is this so crucial?

  • Interoperability: A smart grid is not a monolithic system, but a constellation of devices, platforms, and actors (generators, distributors, consumers, prosumer, regulators). For all these elements to "speak the same language", interoperability is essential. Data governance sets standards for nomenclature, formats, encoding, and synchronization, allowing information to flow frictionlessly between disparate systems. Without it, we risk creating fragmented and costly information silos.
  • Quality: Artificial intelligence algorithms and machine learning, so vital to smart grids, are only as good as the data they are fed with. Data governance ensures the accuracy, completeness, and consistency of data (and future information and knowledge) by defining business rules, cleaning up duplicates, and managing data errors. Poor quality data can lead to wrong decisions, operational inefficiencies, and unreliable results.
  • Security: The interconnection of millions of devices in a smart network exponentially expands the attack surface for cybercriminals. A breach of data security could have catastrophic consequences, from massive power outages to breaches of user privacy. Data governance is the shield that implements robust access controls, encryption protocols, and usage audits, safeguarding the integrity and confidentiality of critical information. Adhering to consolidated security frameworks such as ENS, ISO/IEC 27000, NIST, IEC 62443, and NERC CIP is critical.

Ultimately, effective data governance turns data into critical infrastructure, as important as cables and substations, for decision-making, resource optimization, and intelligent automation.

Data in action: optimising, anticipating and facilitating the energy transition

Governing data is not an end in itself, but the means to unlock vast potential for efficiency and sustainability in the energy sector.

1. Optimisation of consumption and operational efficiency

Exact, complete, consistent, current, credible and real-time data enables multiple advantages in energy management:

  • Consumption at the user level: smart meters empower citizens and businesses by providing them with detailed information about their own consumption. This allows them to identify patterns, adjust their habits, and ultimately reduce their energy bills.
  • Demand management: Utilities can use data to implement demand response (DR) programs. These programs incentivize consumers to reduce or shift their electricity consumption during periods of high demand or high prices, thereby balancing the load on the grid and avoiding costly investments in new infrastructure.
  • Reduced inefficiencies: The availability of accurate and well-integrated data allows utilities to automate tasks, avoid redundant processes, and reduce unplanned downtime in their systems. For example, a generation plant can adjust its production in real-time based on the analysis of performance and demand data.
  • Energy monitoring and emission control: real-time monitoring of energy, water or polluting gas emissions reveals hidden inefficiencies and savings opportunities. Smart dashboards, powered by governed data, enable industrial plants and cities to reduce their costs and advance their environmental sustainability goals.

2. Demand anticipation and grid resilience

Smart grids can also foresee the future of energy consumption:

  • Demand forecasting: By using advanced artificial intelligence and machine learning algorithms  (such as time series analysis or neural networks), historical consumption data, combined with external factors such as weather, holidays, or special events, allow utilities to forecast demand with astonishing accuracy. This anticipation is vital to optimize resource allocation, avoid overloads, and ensure network stability.
  • Predictive maintenance: By combining historical maintenance data with real-time information from sensors on critical equipment, companies can anticipate machine failures before they occur, proactively schedule maintenance, and avoid costly unexpected outages.

3. Facilitating the energy transition

Data governance is an indispensable catalyst for the integration of renewable energy and decarbonization:

  • Integration of renewables: sources such as solar and wind energy are intermittent by nature. Real-time data on generation, weather conditions, and grid status are critical to managing this variability, balancing the load, and maximizing the injection of clean energy into the grid.
  • Distributed Energy Resources (RED) Management: The proliferation of rooftop solar panels, storage batteries, and electric vehicles (which can charge and discharge energy to the grid) requires sophisticated data management. Data governance ensures the interoperability needed to coordinate these resources efficiently, transforming them into "virtual power plants" that can support grid stability.
  • Boosting the circular economy: thanks to the full traceability of a product's life cycle, from design to recycling, data makes it possible to identify opportunities for reuse, recovery of materials and sustainable design. This is crucial to comply with new circular economy regulations and the Digital Product Passport (DPP).
  • Digital twins: For a virtual replica of a physical process or system to work, it needs to be powered by accurate and consistent data. Data governance ensures synchronization between the physical and virtual worlds, enabling reliable simulations to optimize the design of new production lines or the arrangement of elements in a factory.

Tangible benefits for citizens, businesses and administrations

Investment in data governance in smart grids generates significant value for all actors in society:

For citizens

  • Savings on electricity bills: by having access to real-time consumption data and flexible tariffs (for example, with lower prices in off-peak hours), citizens can adjust their habits and reduce their energy costs.
  • Empowerment and control: citizens go from being mere consumers to "prosumers", with the ability to generate their own energy (for example, with solar panels) and even inject the surplus into the grid, being compensated for it. This encourages participation and greater control over their energy consumption.
  • Better quality of life: A more resilient and efficient grid means fewer power interruptions and greater reliability, which translates into a better quality of life and uninterrupted essential services.
  • Promoting sustainability: By participating in demand response programs and adopting more efficient consumption behaviors, citizens contribute directly to the reduction of the country's carbon footprint and energy transition.

For companies

  • Optimization of operations and cost reduction: companies can predict demand, adjust production and perform predictive maintenance of their machinery, reducing losses due to failures and optimizing the use of energy and material resources.
  • New business models: The availability of data creates opportunities for the development of new services and products. This includes platforms for energy exchange, intelligent energy management systems for buildings and homes, or the optimization of charging infrastructures for electric vehicles.
  • Loss reduction: Intelligent data management allows utilities to minimize losses in power transmission and distribution, prevent overloads, and isolate faults faster and more efficiently.
  • Improved traceability: in regulated sectors such as food, automotive or pharmaceuticals, the complete traceability of the product from the raw material to the end customer is not only an added value, but a regulatory obligation. Data governance ensures that this traceability is verifiable and meets standards.
  • Regulatory compliance: Robust data management enables companies to comply with increasingly stringent regulations on sustainability, energy efficiency, and emissions, as well as data privacy regulations (such as GDPR).

For Public Administrations

  • Smart energy policymaking: Aggregated and anonymised data from the smart grid provides public administrations with valuable information to design more effective energy policies, set ambitious decarbonisation targets and strategically plan the country's energy future.
  • Infrastructure planning: with a clear view of consumption patterns and future needs, governments can more efficiently plan grid upgrades and expansions, as well as the integration of distributed energy resources such as smart microgrids.
  • Boosting urban resilience: the ability to manage and coordinate locally distributed energy resources, such as in micro-grids, improves the resilience of cities to extreme events or failures in the main grid.
  • Promotion of technological and data sovereignty: by encouraging the publication of this data in open data portals together with the creation of national and sectoral data spaces, Administrations ensure that the value generated by data stays in the country and in local companies, boosting innovation and competitiveness at an international level.

Challenges and best practices in smart grid data governance

Despite the immense benefits, implementing effective data governance initiatives in the energy sector presents significant challenges:

  • Heterogeneity and complexity of data integration: Data comes from a multitude of disparate sources (meters, sensors, SCADA, ERP, MES, maintenance systems, etc.). Integrating and harmonizing this information is a considerable technical and organizational challenge.
  • Privacy and compliance: Energy consumption data can reveal highly sensitive patterns of behavior. Ensuring user privacy and complying with regulations such as the GDPR is a constant challenge that requires strong ethical and legal frameworks.
  • Cybersecurity: The massive interconnection of devices and systems expands the attack surface, making smart grids attractive targets for sophisticated cyberattacks. Integrating legacy systems with new technologies can also create vulnerabilities.
  • Data quality: Without robust processes, information can be inconsistent, incomplete, or inaccurate, leading to erroneous decisions.
  • Lack of universal standards: The absence of uniform cybersecurity practices and regulations across different regions can reduce the effectiveness of security measures.
  • Resistance to change and lack of data culture: The implementation of new data governance policies and processes can encounter internal resistance, and a lack of understanding about the importance of data often hampers efforts.
  • Role and resource allocation: Clearly defining who is responsible for which aspect of the data and securing adequate financial and human resources is critical to success.
  • Scalability: As the volume and variety of data grows exponentially, the governance structure must be able to scale efficiently to avoid bottlenecks and compliance issues.

To overcome these challenges, the adoption of the following best practices is essential:

  • Establish a strong governance framework: define clear principles, policies, processes and roles from the outset, with the support of public administrations and senior management. This can be solved with the implementation of the processes of UNE 0077 to 0080,  which includes the definition of data governance, management and quality processes, as well as the definition of organizational structures.
  • Ensure data quality: Implement data quality assessment methodologies and processes, such as data asset classification and cataloguing, quality control (validation, duplicate cleanup), and data lifecycle management. All this can be based on the implementation of a quality model following UNE 0081.
  • Prioritize cybersecurity and privacy: Implement robust security frameworks (ENS, ISO 27000, NIST, IEC 62443, NERC CIP), secure IoT devices, use advanced threat detection tools (including AI), and build resilient systems with network segmentation and redundancy. Ensure compliance with data privacy regulations (such as GDPR).
  • Promote interoperability through standards: adopt open standards for communication and data exchange between systems, such as OPC UA or ISA-95.
  • Invest in technology and automation: Use data governance tools that enable automatic data discovery and classification, application of data protection rules, automation of metadata management, and cataloguing of data. Automating routine tasks improves efficiency and reduces errors.
  • Collaboration and Information Sharing: Encourage the exchange of threat and best practice information among utilities, government agencies, and other industry stakeholders. In this regard, it is worth highlighting the more than 900 datasets published in the datos.gob.es catalogue  on the subject of Energy, as well as the creation of "Data Spaces" (such as the National Data Space for Energy or Industry in Spain) facilitates the secure and efficient sharing of data between organisations, boosting innovation and sectoral competitiveness.
  • Continuous monitoring and improvement: data governance is a continuous process. KPIs should be established to monitor progress, evaluate performance, and make improvements based on feedback and regulatory or strategic changes.

Conclusions: a connected and sustainable future

Energy and data are linked in the future. Smart grids are the manifestation of this symbiosis, and data governance is the key to unlocking its potential. By transforming data from simple records into strategic assets and critical infrastructure, we can move towards a more efficient, sustainable and resilient energy model.

Collaboration between companies, citizens and administrations, driven by initiatives such as the National Industry Data Space in Spain, is essential to build this future. This space not only seeks to improve industrial efficiency, but also to reinforce the country's technological and data sovereignty, ensuring that the value generated by data benefits our own companies, regions and sectors. By investing in strong data governance initiatives and building shared data ecosystems, we are investing in an industry that is more connected, smarter and ready for tomorrow's energy and climate challenges.


Content prepared by Dr. Fernando Gualo, Professor at UCLM and Data Governance and Quality Consultant. The content and the point of view reflected in this publication are the sole responsibility of its author.

calendar icon
Blog

Imagine a machine that can tell if you're happy, worried, or about to make a decision, even before you know clearly. Although it sounds like science fiction, that future is already starting to take shape. Thanks to advances in neuroscience and technology, today we can record, analyze, and even predict certain patterns of brain activity. The data that is generated from these records is known as neurodata.

In this article we will explain this concept, as well as potential use cases, based on the report "TechDispatch on Neurodata", of the Spanish Data Protection Agency (AEPD).

What is neurodata and how is it collected?

The term neurodata refers to data that is collected directly from the brain and nervous system, using technologies such as electroencephalography (EEG), functional magnetic resonance imaging (fMRI), neural implants, or even brain-computer interfaces. In this sense, its uptake is driven by neurotechnologies.

According to the OECD, neurotechnologies are identified with "devices and procedures that are used to access, investigate, evaluate, manipulate, and emulate the structure and function of neural systems." Neurotechnologies can be invasive (if they require brain-computer interfaces that are surgically implanted in the brain) or non-invasive, with interfaces that are placed outside the body (such as glasses or headbands).

There are also two common ways to collect data:

  • Passive collection, where data is captured on a regular basis without the subject having to perform any specific activity.
  • Active collection, where data is collected while users perform a specific activity. For example, thinking explicitly about something, answering questions, performing physical tasks, or receiving certain stimuli.

Potential use cases

Once the raw data has been collected, it is stored and processed. The treatment will vary according to the purpose and the intended use of the neurodata.

Figure 1. Common structure to understand the processing of neurodata in different use cases. Source: "TechDispatch Report on Neurodata", by the Spanish Data Protection Agency (AEPD).

As can be seen in the image above, the Spanish Data Protection Agency has identified 3 possible purposes:

  1. Neurodata processing to acquire direct knowledge and/or make predictions.

Neurodata can uncover patterns that decode brain activity in a variety of industries, including:

  • Health: Neurodata facilitates research into the functioning of the brain and nervous system, making it possible to detect signs of neurological or mental diseases, make early diagnoses, and predict their behavior. This facilitates personalized treatment from very early stages. Its impact can be remarkable, for example, in the fight against Alzheimer's, epilepsy or depression.

  • Education: through brain stimuli, students' performance and learning outcomes can be analyzed. For example, students' attention or cognitive effort can be measured. By cross-referencing this data with other internal (such as student preferences) and external (such as classroom conditions or teaching methodology) aspects, decisions can be made aimed at adapting the pace of teaching.

  • Marketing, economics and leisure: the brain's response to certain stimuli can be analysed to improve leisure products or advertising campaigns. The objective is to know the motivations and preferences that impact decision-making. They can also be used in the workplace, to track employees, learn about their skills, or determine how they perform under pressure.

  • Safety and surveillance: Neurodata can be used to monitor factors that affect drivers or pilots, such as drowsiness or inattention, and thus prevent accidents.
  1. Neurodata processing to control applications or devices.

As in the previous stage, it involves the collection and analysis of information for decision-making, but it also involves an additional operation: the generation of actions through mental impulses. Let's look at several examples:

  • Orthopaedic or prosthetic aids, medical implants or environment-assisted living: thanks to technologies such as brain-computer interfaces, it is possible to design prostheses that respond to the user's intention through brain activity. In addition, neurodata can be integrated with smart home systems to anticipate needs, adjust the environment to the user's emotional or cognitive states, and even issue alerts for early signs of neurological deterioration. This can lead to an improvement in patients' autonomy and quality of life.

  • Robotics:  The user's neural signals can be interpreted to control machinery, precision devices or applications without the need to use hands. This allows, for example, a person to operate a robotic arm or a surgical tool simply with their thinking, which is especially valuable in environments where extreme precision is required or when the operator has reduced mobility.
  • Video games, virtual reality and metaverse: since neurodata allows software devices to be controlled, brain-computer interfaces can be developed that make it possible to manage characters or perform actions within a game, only with the mind, without the need for physical controls. This not only increases player immersion, but opens the door to more inclusive and personalized experiences.
  • Defense: Soldiers can operate weapons systems, unmanned vehicles, drones, or explosive ordnance disposal robots remotely, increasing personal safety and operational efficiency in critical situations.
  1. Neurodata treatment for the stimulation or modulation of the subject, achieving neurofeedback.

In this case, signals from the brain (outputs) are used to generate new signals that feed back into the brain (such as inputs), which involves the control of brain waves. It is the most complex field from an ethical point of view, since actions could be generated that the user is not aware of. Some examples are:

  • Psychology: Neurodata has the potential to change the way the brain responds to certain stimuli. They can therefore be used as a therapy method to treat ADHD (Attention Deficit Hyperactivity Disorder), anxiety, depression, epilepsy, autism spectrum disorder, insomnia or drug addiction, among others.

  • Neuroenhancement: They can also be used to improve cognitive and affective abilities in healthy people. Through the analysis and personalized stimulation of brain activity, it is possible to optimize functions such as memory, concentration, decision-making or emotional management.

Ethical challenges of the use of neurodata

As we have seen, although the potential of neurodata is enormous, it also poses great ethical and legal challenges. Unlike other types of data, neurodata can reveal deeply intimate aspects of a person, such as their desires, emotions, fears, or intentions. This opens the door to potential misuses, such as manipulation, covert surveillance, or discrimination based on neural features. In addition, they can be collected remotely and acted upon without the subject being aware of the manipulation.

This has generated a debate about the need for new rights, such as neurorights, which seek to protect mental privacy, personal identity and cognitive freedom. Various international organizations, including the European Union, are taking measures to address these challenges and advance in the creation of regulatory and ethical frameworks that protect fundamental rights in the use of neurotechnological technologies. We will soon publish an article that will delve into these aspects.

In conclusion, neurodata is a very promising advance, but not without challenges. Its ability to transform sectors such as health, education or robotics is undeniable, but so are the ethical and legal challenges posed by its use. As we move towards a future where mind and machine are increasingly connected, it is crucial to establish regulatory frameworks that ensure the protection of human rights, especially mental privacy and individual autonomy. In this way, we will be able to harness the full potential of neurodata in a fair, safe and responsible way, for the benefit of society as a whole.

calendar icon
Documentación

Data sharing has become a critical pillar for the advancement of analytics and knowledge exchange, both in the private and public sectors. Organizations of all sizes and industries—companies, public administrations, research institutions, developer communities, and individuals—find strong value in the ability to share information securely, reliably, and efficiently.

This exchange goes beyond raw data or structured datasets. It also includes more advanced data products such as trained machine learning models, analytical dashboards, scientific experiment results, and other complex artifacts that have significant impact through reuse. In this context, the governance of these resources becomes essential. It is not enough to simply move files from one location to another; it is necessary to guarantee key aspects such as access control (who can read or modify a given resource), traceability and auditing (who accessed it, when, and for what purpose), and compliance with regulations or standards, especially in enterprise and governmental environments.

To address these requirements, Unity Catalog emerges as a next-generation metastore, designed to centralize and simplify the governance of data and data-related resources. Originally part of the services offered by the Databricks platform, the project has now transitioned into the open source community, becoming a reference standard. This means that it can now be freely used, modified, and extended, enabling collaborative development. As a result, more organizations are expected to adopt its cataloging and sharing model, promoting data reuse and the creation of analytical workflows and technological innovation.

 

Unity Catalog Overview

Figure 1. Image. Source: https://docs.unitycatalog.io/

Access the data lab repository on Github.

Run the data preprocessing code on Google Colab

Objectives

In this exercise, we will learn how to configure Unity Catalog, a tool that helps us organize and share data securely in the cloud. Although we will use some code, each step will be explained clearly so that even those with limited programming experience can follow along through a hands-on lab.

We will work with a realistic scenario in which we manage public transportation data from different cities. We’ll create data catalogs, configure a database, and learn how to interact with the information using tools like Docker, Apache Spark, and MLflow.

Difficulty level: Intermediate.

Figure 2: Unity catalogue schematic

Required Resources

In this section, we’ll explain the prerequisites and resources needed to complete this lab. The lab is designed to be run on a standard personal computer (Windows, macOS, or Linux).

We will be using the following tools and environments:

  • Docker Desktop: Docker allows us to run applications in isolated environments called containers. A container is like a "box" that includes everything needed for the application to run properly, regardless of the operating system.
  • Visual Studio Code: Our main working environment will be a Python Notebook, which we will run and edit using the widely adopted code editor Visual Studio Code (VS Code).
  • Unity Catalog: Unity Catalog is a data governance tool that allows us to organize and control access to resources such as tables, data volumes, functions, and machine learning models. In this lab, we will use its open source version, which can be deployed locally, to learn how to manage data catalogs with permission control, traceability, and hierarchical structure. Unity Catalog acts as a centralized metastore, making data collaboration and reuse more secure and efficient.
  • Amazon Web Services (AWS): AWS will serve as our cloud provider to host some of the lab’s data—specifically, raw data files (such as JSON) that we will manage using data volumes. We’ll use the Amazon S3 service to store these files and configure the necessary credentials and permissions so that Unity Catalog can interact with them in a controlled manner

Key Learnings from the Lab

Throughout this hands-on exercise, participants will deploy the application, understand its architecture, and progressively build a data catalog while applying best practices in organization, access control, and data traceability.

Deployment and First Steps

  • We clone the Unity Catalog repository and launch it using Docker.

  • We explore its architecture: a backend accessible via API and CLI, and an intuitive graphical user interface.

  • We navigate the core resources managed by Unity Catalog: catalogs, schemas, tables, volumes, functions, and models.

Figure 2. Screenshot

What Will We Learn Here?

How to launch theapplication, understand its core components, and start interacting with it through different interfaces: the web UI, API, and CLI.

Resource Organization

  • We configure an external MySQL database as the metadata repository.

  • We create catalogs to represent different cities and schemas for various public services.

Figure 3. Screenshot

What Will We Learn Here?

How to structure data governance at different levels (city, service, dataset) and manage metadata in a centralized and persistent way.

Data Construction and Real-World Usage

  • We create structured tables to represent routes, buses, and bus stops.

  • We load real data into these tables using PySpark.

  • We set up an AWS S3 bucket as raw data storage (volumes).

  • We upload JSON telemetry event files and govern them from Unity Catalog.

Figure 4. Diagram

 
   

What Will We Learn Here?

How to work with different types of data (structured and unstructured), and how to integrate them with external sources like AWS S3.

Reusable Functions and AI Models

  • We register custom functions (e.g., distance calculation) directly in the catalog.

  • We create and register machine learning models using MLflow.

  • We run predictions from Unity Catalog just like any other governed resource.

Figure 5. Screenshot

 
   

What Will We Learn Here?

How to extend data governance to functions and models, and how to enable their reuse and traceability in collaborative environments.

Results and Conclusions

As a result of this hands-on lab, we gained practical experience with Unity Catalog as an open platform for the governance of data and data-related resources, including machine learning models. We explored its capabilities, deployment model, and usage through a realistic use case and a tool ecosystem similar to what you might find in an actual organization.

Through this exercise, we configured and used Unity Catalog to organize public transportation data. Specifically, you will be able to:

  • Learn how to install tools like Docker and Spark.
  • Create catalogs, schemas, and tables in Unity Catalog.
  • Load data and store it in an Amazon S3 bucket.
  • Implement a machine learning model using MLflow.

In the coming years, we will see whether tools like Unity Catalog achieve the level of standardization needed to transform how data resources are managed and shared across industries.

We encourage you to keep exploring data science! Access the full repository here

 


Content prepared by Juan Benavente, senior industrial engineer and expert in technologies linked to the data economy. The contents and points of view reflected in this publication are the sole responsibility of the author.

calendar icon
Blog

Generative artificial intelligence is beginning to find its way into everyday applications ranging from virtual agents (or teams of virtual agents) that resolve queries when we call a customer service centre to intelligent assistants that automatically draft meeting summaries or report proposals in office environments.

These applications, often governed by foundational language models (LLMs), promise to revolutionise entire industries on the basis of huge productivity gains. However, their adoption brings new challenges because, unlike traditional software, a generative AI model does not follow fixed rules written by humans, but its responses are based on statistical patterns learned from processing large volumes of data. This makes its behaviour less predictable and more difficult to explain, and sometimes leads to unexpected results, errors that are difficult to foresee, or responses that do not always align with the original intentions of the system's creator.

Therefore, the validation of these applications from multiple perspectives such as ethics, security or consistency is essential to ensure confidence in the results of the systems we are creating in this new stage of digital transformation.

What needs to be validated in generative AI-based systems?

Validating generative AI-based systems means rigorously checking that they meet certain quality and accountability criteria before relying on them to solve sensitive tasks.

It is not only about verifying that they ‘work’, but also about making sure that they behave as expected, avoiding biases, protecting users, maintaining their stability over time, and complying with applicable ethical and legal standards. The need for comprehensive validation is a growing consensus among experts, researchers, regulators and industry: deploying AI reliably requires explicit standards, assessments and controls.

We summarize four key dimensions that need to be checked in generative AI-based systems to align their results with human expectations:

  • Ethics and fairness: a model must respect basic ethical principles and avoid harming individuals or groups. This involves detecting and mitigating biases in their responses so as not to perpetuate stereotypes or discrimination. It also requires filtering toxic or offensive content that could harm users. Equity is assessed by ensuring that the system offers consistent treatment to different demographics, without unduly favouring or excluding anyone.
  • Security and robustness: here we refer to both user safety (that the system does not generate dangerous recommendations or facilitate illicit activities) and technical robustness against errors and manipulations. A safe model must avoid instructions that lead, for example, to illegal behavior, reliably rejecting those requests. In addition, robustness means that the system can withstand adversarial attacks (such as requests designed to deceive you) and that it operates stably under different conditions.
  • Consistency and reliability: Generative AI results must be consistent, consistent, and correct. In applications such as medical diagnosis or legal assistance, it is not enough for the answer to sound convincing; it must be true and accurate. For this reason, aspects such as the logical coherence of the answers, their relevance with respect to the question asked and the factual accuracy of the information are validated. Its stability over time is also checked (that in the face of two similar requests equivalent results are offered under the same conditions) and its resilience (that small changes in the input do not cause substantially different outputs).
  • Transparency and explainability: To trust the decisions of an AI-based system, it is desirable to understand how and why it produces them. Transparency includes providing information about training data, known limitations, and model performance across different tests. Many companies are adopting the practice of publishing "model cards," which summarize how a system was designed and evaluated, including bias metrics, common errors, and recommended use cases. Explainability goes a step further and seeks to ensure that the model offers, when possible, understandable explanations of its results (for example, highlighting which data influenced a certain recommendation). Greater transparency and explainability increase accountability, allowing developers and third parties to audit the behavior of the system.

Open data: transparency and more diverse evidence

Properly validating AI models and systems, particularly in terms of fairness and robustness, requires representative and diverse datasets that reflect the reality of different populations and scenarios.

On the other hand, if only the companies that own a system have data to test it, we have to rely on their own internal evaluations. However, when open datasets and public testing standards exist, the community (universities, regulators, independent developers, etc.) can test the systems autonomously, thus functioning as an independent counterweight that serves the interests of society.

A concrete example was given by Meta (Facebook) when it released its Casual Conversations v2 dataset in 2023. It is an open dataset, obtained with informed consent, that collects videos from people from 7 countries (Brazil, India, Indonesia, Mexico, Vietnam, the Philippines and the USA), with 5,567 participants who provided attributes such as age, gender, language and skin tone.

Meta's objective with the publication was precisely to make it easier for researchers to evaluate the impartiality and robustness of AI systems in vision and voice recognition. By expanding the geographic provenance of the data beyond the U.S., this resource allows you to check if, for example, a facial recognition model works equally well with faces of different ethnicities, or if a voice assistant understands accents from different regions.

The diversity that open data brings also helps to uncover neglected areas in AI assessment. Researchers from  Stanford's Human-Centered Artificial Intelligence (HAI) showed in the HELM (Holistic Evaluation of Language Models) project  that many language models are not evaluated in minority dialects of English or in underrepresented languages, simply because there are no quality data in the  most well-known benchmarks.

The community can identify these gaps and create new test sets to fill them (e.g., an open dataset of FAQs in Swahili to validate the behavior of a  multilingual chatbot). In this sense, HELM has incorporated broader evaluations precisely thanks to the availability of open data, making it possible to measure not only the performance of the models in common tasks, but also their behavior in other linguistic, cultural and social contexts. This has contributed to making visible the current limitations of the models and to promoting the development of more inclusive and representative systems of the real world or models more adapted to the specific needs of local contexts, as is the case of the ALIA foundational model, developed in Spain.

In short, open data contributes to democratizing the ability to audit AI systems, preventing the power of validation from residing only in a few. They allow you to reduce costs and barriers as a small development team can test your model with open sets without having to invest great efforts in collecting their own data. This not only fosters innovation, but also ensures that local AI solutions from small businesses are also subject to rigorous validation standards.

The validation of applications based on generative AI is today an unquestionable necessity to ensure that these tools operate in tune with our values and expectations. It is not a trivial process, it requires new methodologies, innovative metrics and, above all, a culture of responsibility around AI. But the benefits are clear, a rigorously validated AI system will be more trustworthy, both for the individual user who, for example, interacts with a chatbot without fear of receiving a toxic response, and for society as a whole who can accept decisions based on these technologies knowing that they have been properly audited. And open data helps to cement this trust by fostering transparency, enriching evidence with diversity, and involving the entire community in the validation of AI systems.


Content prepared by Jose Luis Marín, Senior Consultant in Data, Strategy, Innovation & Digitalization. The contents and views reflected in this publication are the sole responsibility of the author.

calendar icon
Sectores
calendar icon
Blog

Data is a fundamental resource for improving our quality of life because it enables better decision-making processes to create personalised products and services, both in the public and private sectors. In contexts such as health, mobility, energy or education, the use of data facilitates more efficient solutions adapted to people's real needs. However, in working with data, privacy plays a key role. In this post, we will look at how data spaces, the federated computing paradigm and federated learning, one of its most powerful applications, provide a balanced solution for harnessing the potential of data without compromising privacy. In addition, we will highlight how federated learning can also be used with open data to enhance its reuse in a collaborative, incremental and efficient way.

Privacy, a key issue in data management

As mentioned above, the intensive use of data requires increasing attention to privacy. For example, in eHealth, secondary misuse of electronic health record data could violate patients' fundamental rights. One effective way to preserve privacy is through data ecosystems that prioritise data sovereignty, such as data spaces. A dataspace is a federated data management system that allows data to be exchanged reliably between providers and consumers. In addition, the data space ensures the interoperability of data to create products and services that create value. In a data space, each provider maintains its own governance rules, retaining control over its data (i.e. sovereignty over its data), while enabling its re-use by consumers. This implies that each provider should be able to decide what data it shares, with whom and under what conditions, ensuring compliance with its interests and legal obligations.

Federated computing and data spaces

Data spaces represent an evolution in data management, related to a paradigm called federated computing, where data is reused without the need for data flow from data providers to consumers. In federated computing, providers transform their data into privacy-preserving intermediate results so that they can be sent to data consumers. In addition, this enables other Data Privacy-Enhancing Technologies(Privacy-Enhancing Technologies)to be applied. Federated computing aligns perfectly with reference architectures such as Gaia-X and its Trust Framework, which sets out the principles and requirements to ensure secure, transparent and rule-compliant data exchange between data providers and data consumers.

Federated learning

One of the most powerful applications of federated computing is federated machine learning ( federated learning), an artificial intelligence technique that allows models to be trained without centralising data. That is, instead of sending the data to a central server for processing, what is sent are the models trained locally by each participant.

These models are then combined centrally to create a global model. As an example, imagine a consortium of hospitals that wants to develop a predictive model to detect a rare disease. Every hospital holds sensitive patient data, and open sharing of this data is not feasible due to privacy concerns (including other legal or ethical issues). With federated learning, each hospital trains the model locally with its own data, and only shares the model parameters (training results) centrally. Thus, the final model leverages the diversity of data from all hospitals without compromising the individual privacy and data governance rules of each hospital.

Training in federated learning usually follows an iterative cycle:

  1. A central server starts a base model and sends it to each of the participating distributed nodes.
  2. Each node trains the model locally with its data.
  3. Nodes return only the parameters of the updated model, not the data (i.e. data shuttling is avoided).
  4. The central server aggregates parameter updates, training results at each node and updates the global model.
  5. The cycle is repeated until a sufficiently accurate model is achieved.

Central server -> starts a base model ---> sends it to each of the participating nodes 2. Node (1), node (2), node (X) ---> train the model locally with their data --> return the parameters of the updated model (data transfer is avoided) 3. Central server -> adds updates to the parameters (training results at each node) -> updates the global model *The cycle repeats until an accurate model is obtained

Figure 1. Visual representing the federated learning training process. Own elaboration

This approach is compatible with various machine learning algorithms, including deep neural networks, regression models, classifiers, etc.

Benefits and challenges of federated learning

Federated learning offers multiple benefits by avoiding data shuffling. Below are the most notable examples:

  1. Privacy and compliance: by remaining at source, data exposure risks are significantly reduced and compliance with regulations such as the General Data Protection Regulation (GDPR) is facilitated.
  2. Data sovereignty: Each entity retains full control over its data, which avoids competitive conflicts.
  3. Efficiency: avoids the cost and complexity of exchanging large volumes of data, speeding up processing and development times.
  4. Trust: facilitates frictionless collaboration between organisations.

There are several use cases in which federated learning is necessary, for example:

  • Health: Hospitals and research centres can collaborate on predictive models without sharing patient data.
  • Finance: banks and insurers can build fraud detection or risk-sharing analysis models, while respecting the confidentiality of their customers.
  • Smart tourism: tourist destinations can analyse visitor flows or consumption patterns without the need to unify the databases of their stakeholders (both public and private).
  • Industry: Companies in the same industry can train models for predictive maintenance or operational efficiency without revealing competitive data.

While its benefits are clear in a variety of use cases, federated learning also presents technical and organisational challenges:

  • Data heterogeneity: Local data may have different formats or structures, making training difficult. In addition, the layout of this data may change over time, which presents an added difficulty.
  • Unbalanced data: Some nodes may have more or higher quality data than others, which may skew the overall model.
  • Local computational costs: Each node needs sufficient resources to train the model locally.
  • Synchronisation: the training cycle requires good coordination between nodes to avoid latency or errors.

Beyond federated learning

Although the most prominent application of federated computing is federated learning, many additional applications in data management are emerging, such as federated data analytics (federated analytics). Federated data analysis allows statistical and descriptive analyses to be performed on distributed data without the need to move the data to the consumers; instead, each provider performs the required statistical calculations locally and only shares the aggregated results with the consumer according to their requirements and permissions. The following table shows the differences between federated learning and federated data analysis.

 

Criteria

Federated learning

Federated data analysis

Target

Prediction and training of machine learning models.  Descriptive analysis and calculation of statistics. 
Task type Predictive tasks (e.g. classification or regression). Descriptive tasks (e.g. means or correlations). 
Example Train models of disease diagnosis using medical images from various hospitals. Calculation of health indicators for a health area without moving data between hospitals.
Expected output Modelo global entrenado. Resultados estadísticos agregados.
Nature Iterativa. Directa.
Computational complexity​ Alta. Media. 
Privacy and sovereignty High Average
Algorithms Machine learning. Statistical algorithms.

Figure 1. Comparative table. Source: own elaboration

Federated learning and open data: a symbiosis to be explored

In principle, open data resolves privacy issues prior to publication, so one would think that federated learning techniques would not be necessary. Nothing could be further from the truth. The use of federated learning techniques can bring significant advantages in the management and exploitation of open data. In fact, the first aspect to highlight is that open data portals such as datos.gob.es or data.europa.eu are federated environments. Therefore, in these portals, the application of federated learning on large datasets would allow models to be trained directly at source, avoiding transfer and processing costs. On the other hand, federated learning would facilitate the combination of open data with other sensitive data without compromising the privacy of the latter. Finally, the nature of a wide variety of open data types is very dynamic (such as traffic data), so federated learning would enable incremental training, automatically considering new updates to open datasets as they are published, without the need to restart costly training processes.

Federated learning, the basis for privacy-friendly AI

Federated machine learning represents a necessary evolution in the way we develop artificial intelligence services, especially in contexts where data is sensitive or distributed across multiple providers. Its natural alignment with the concept of the data space makes it a key technology to drive innovation based on data sharing, taking into account privacy and maintaining data sovereignty.

As regulation (such as the European Health Data Space Regulation) and data space infrastructures evolve, federated learning, and other types of federated computing, will play an increasingly important role in data sharing, maximising the value of data, but without compromising privacy. Finally, it is worth noting that, far from being unnecessary, federated learning can become a strategic ally to improve the efficiency, governance and impact of open data ecosystems.


Jose Norberto Mazón, Professor of Computer Languages and Systems at the University of Alicante. The contents and views reflected in this publication are the sole responsibility of the author.

calendar icon
Documentación

In the current landscape of data analysis and artificial intelligence, the automatic generation of comprehensive and coherent reports represents a significant challenge. While traditional tools allow for data visualization or generating isolated statistics, there is a need for systems that can investigate a topic in depth, gather information from diverse sources, and synthesize findings into a structured and coherent report.

In this practical exercise, we will explore the development of a report generation agent based on LangGraph and artificial intelligence. Unlike traditional approaches based on templates or predefined statistical analysis, our solution leverages the latest advances in language models to:

  1. Create virtual teams of analysts specialized in different aspects of a topic.
  2. Conduct simulated interviews to gather detailed information.
  3. Synthesize the findings into a coherent and well-structured report.

Access the data laboratory repository on Github.

Run the data preprocessing code on Google Colab.

As shown in Figure 1, the complete agent flow follows a logical sequence that goes from the initial generation of questions to the final drafting of the report.

Diagrama de flujo del funcionamiento del agente

Figure 1. Agent flow diagram.

Application Architecture

The core of the application is based on a modular design implemented as an interconnected state graph, where each module represents a specific functionality in the report generation process. This structure allows for a flexible workflow, recursive when necessary, and with capacity for human intervention at strategic points.

Main Components

The system consists of three fundamental modules that work together:

1. Virtual Analysts Generator

This component creates a diverse team of virtual analysts specialized in different aspects of the topic to be investigated. The flow includes:

  • Initial creation of profiles based on the research topic.
  • Human feedback point that allows reviewing and refining the generated profiles.
  • Optional regeneration of analysts incorporating suggestions.

This approach ensures that the final report includes diverse and complementary perspectives, enriching the analysis.

2. Interview System

Once the analysts are generated, each one participates in a simulated interview process that includes:

  • Generation of relevant questions based on the analyst's profile.
  • Information search in sources via Tavily Search and Wikipedia.
  • Generation of informative responses combining the obtained information.
  • Automatic decision on whether to continue or end the interview based on the information gathered.
  • Storage of the transcript for subsequent processing.

The interview system represents the heart of the agent, where the information that will nourish the final report is obtained. As shown in Figure 2, this process can be monitored in real time through LangSmith, an open observability tool that allows tracking each step of the flow.

Logs de Langsmith, plataforma de monitorizaci'on

Figure 2. System monitoring via LangGraph. Concrete example of an analyst-interviewer interaction.

3. Report Generator

Finally, the system processes the interviews to create a coherent report through:

  • Writing individual sections based on each interview.
  • Creating an introduction that presents the topic and structure of the report.
  • Organizing the main content that integrates all sections.
  • Generating a conclusion that synthesizes the main findings.
  • Consolidating all sources used.

The Figure 3 shows an example of the report resulting from the complete process, demonstrating the quality and structure of the final document generated automatically.

Informe generado por el agente

Figure 3. View of the report resulting from the automatic generation process to the prompt "Open data in Spain".

 

What can you learn?

This practical exercise allows you to learn:

Integration of advanced AI in information processing systems:

  • How to communicate effectively with language models.
  • Techniques to structure prompts that generate coherent and useful responses.
  • Strategies to simulate virtual teams of experts.

Development with LangGraph:

  • Creation of state graphs to model complex flows.
  • Implementation of conditional decision points.
  • Design of systems with human intervention at strategic points.

Parallel processing with LLMs:

  • Parallelization techniques for tasks with language models.
  • Coordination of multiple independent subprocesses.
  • Methods for consolidating scattered information.

Good design practices:

  • Modular structuring of complex systems.
  • Error handling and retries.
  • Tracking and debugging workflows through LangSmith.

Conclusions and future

This exercise demonstrates the extraordinary potential of artificial intelligence as a bridge between data and end users. Through the practical case developed, we can observe how the combination of advanced language models with flexible architectures based on graphs opens new possibilities for automatic report generation.

The ability to simulate virtual expert teams, perform parallel research and synthesize findings into coherent documents, represents a significant step towards the democratization of analysis of complex information.

For those interested in expanding the capabilities of the system, there are multiple promising directions for its evolution:

  • Incorporation of automatic data verification mechanisms to ensure accuracy.
  • Implementation of multimodal capabilities that allow incorporating images and visualizations.
  • Integration with more sources of information and knowledge bases.
  • Development of more intuitive user interfaces for human intervention.
  • Expansion to specialized domains such as medicine, law or sciences.

In summary, this exercise not only demonstrates the feasibility of automating the generation of complex reports through artificial intelligence, but also points to a promising path towards a future where deep analysis of any topic is within everyone's reach, regardless of their level of technical experience. The combination of advanced language models, graph architectures and parallelization techniques opens a range of possibilities to transform the way we generate and consume information.

calendar icon
Documentación

 The Spanish Data Protection Agency  has recently published the Spanish translation of the Guide on Synthetic Data Generation, originally produced by the Data Protection Authority of Singapore. This document provides technical and practical guidance for data protection officers, managers and data protection officers on how to implement this technology that allows simulating real data while maintaining their statistical characteristics without compromising personal information.

The guide highlights how synthetic data can drive the data economy, accelerate innovation and mitigate risks in security breaches. To this end, it presents case studies, recommendations and best practices aimed at reducing the risks of re-identification. In this post, we analyse the key aspects of the Guide highlighting main use cases and examples of practical application.

What are synthetic data? Concept and benefits

Synthetic data is artificial data generated using mathematical models specifically designed for artificial intelligence (AI) or machine learning (ML) systems. This data is created by training a model on a source dataset to imitate its characteristics and structure, but without exactly replicating the original records.

High-quality synthetic data retain the statistical properties and patterns of the original data. They therefore allow for analyses that produce results similar to those that would be obtained with real data. However, being artificial, they significantly reduce the risks associated with the exposure of sensitive or personal information.

For more information on this topic, you can read this Monographic report on synthetic data:. What are they and what are they used for? with detailed information on the theoretical foundations, methodologies and practical applications of this technology.

The implementation of synthetic data offers multiple advantages for organisations, for example:

  • Privacy protection: allow data analysis while maintaining the confidentiality of personal or commercially sensitive information.
  • Regulatory compliance: make it easier to follow data protection regulations while maximising the value of information assets.
  • Risk reduction: minimise the chances of data breaches and their consequences.
  • Driving innovation: accelerate the development of data-driven solutions without compromising privacy.
  • Enhanced collaboration: Enable valuable information to be shared securely across organisations and departments.

Steps to generate synthetic data

To properly implement this technology, the Guide on Synthetic Data Generation recommends following a structured five-step approach:

  1. Know the data: cClearly understand the purpose of the synthetic data and the characteristics of the source data to be preserved, setting precise targets for the threshold of acceptable risk and expected utility.
  2. Prepare the data: iidentify key insights to be retained, select relevant attributes, remove or pseudonymise direct identifiers, and standardise formats and structures in a well-documented data dictionary .
  3. Generate synthetic data: sselect the most appropriate methods according to the use case, assess quality through completeness, fidelity and usability checks, and iteratively adjust the process to achieve the desired balance.
  4. Assess re-identification risks: aApply attack-based techniques to determine the possibility of inferring information about individuals or their membership of the original set, ensuring that risk levels are acceptable.
  5. Manage residual risks: iImplement technical, governance and contractual controls to mitigate identified risks, properly documenting the entire process.

Practical applications and success stories

To realise all these benefits, synthetic data can be applied in a variety of scenarios that respond to specific organisational needs. The Guide mentions, for example:

1 Generation of datasets for training AI/ML models: lSynthetic data solves the problem of the scarcity of labelled (i.e. usable) data for training AI models. Where real data are limited, synthetic data can be a cost-effective alternative. In addition, they allow to simulate extraordinary events or to increase the representation of minority groups in training sets. An interesting application to improve the performance and representativeness of all social groups in AI models.

2 Data analysis and collaboration: eThis type of data facilitates the exchange of information for analysis, especially in sectors such as health, where the original data is particularly sensitive. In this sector as in others, they provide stakeholders with a representative sample of actual data without exposing confidential information, allowing them to assess the quality and potential of the data before formal agreements are made.

3 Software testing: sis very useful for system development and software testing because it allows the use of realistic, but not real data in development environments, thus avoiding possible personal data breaches in case of compromise of the development environment..

The practical application of synthetic data is already showing positive results in various sectors:

I. Financial sector: fraud detection. J.P. Morgan has successfully used synthetic data to train fraud detection models, creating datasets with a higher percentage of fraudulent cases that significantly improved the models' ability to identify anomalous behaviour.

II. Technology sector: research on AI bias. Mastercard collaborated with researchers to develop methods to test for bias in AI using synthetic data that maintained the true relationships of the original data, but were private enough to be shared with outside researchers, enabling advances that would not have been possible without this technology.

III. Health sector: safeguarding patient data. Johnson & Johnson implemented AI-generated synthetic data as an alternative to traditional anonymisation techniques to process healthcare data, achieving a significant improvement in the quality of analysis by effectively representing the target population while protecting patients' privacy.

The balance between utility and protection

It is important to note that synthetic data are not inherently risk-free. The similarity to the original data could, in certain circumstances, allow information about individuals or sensitive data to be leaked. It is therefore crucial to strike a balance between data utility and data protection.

This balance can be achieved by implementing good practices during the process of generating synthetic data, incorporating protective measures such as:

  • Adequate data preparation: removal of outliers, pseudonymisation of direct identifiers and generalisation of granular data.
  • Re-identification risk assessment: analysis of the possibility that synthetic data can be linked to real individuals.
  • Implementation of technical controls: adding noise to data, reducing granularity or applying differential privacy techniques.

Synthetic data represents a exceptional opportunity to drive data-driven innovation while respecting privacy and complying with data protection regulations. Their ability to generate statistically representative but artificial information makes them a versatile tool for multiple applications, from AI model training to inter-organisational collaboration and software development.

By properly implementing the best practices and controls described in Guide on synthetic data generation translated by the AEPD, organisations can reap the benefits of synthetic data while minimising the associated risks, positioning themselves at the forefront of responsible digital transformation. The adoption of privacy-enhancing technologies such as synthetic data is not only a defensive measure, but a proactive step towards an organisational culture that values both innovation and data protection, which are critical to success in the digital economy of the future.

calendar icon
Blog

The evolution of generative AI has been dizzying: from the first great language models that impressed us with their ability to reproduce human reading and writing, through the advanced RAG (Retrieval-Augmented Generation) techniques that quantitatively improved the quality of the responses provided and the emergence of intelligent agents, to an innovation that redefines our relationship with technology: Computer use.

At the end of April 2020, just one month after the start of an unprecedented period of worldwide home confinement due to the SAR-Covid19 global pandemic, we spread from datos.gob.es the large GPT-2 and GPT-3 language models. OpenAI, founded in 2015, had presented almost a year earlier (February 2019) a new language model that was able to generate written text virtually indistinguishable from that created by a human. GPT-2 had been trained on a corpus (set of texts prepared to train language models) of about 40 GB (Gigabytes) in size (about 8 million web pages), while the latest family of models based on GPT-4 is estimated to have been trained on TB (Terabyte) sized corpora; a thousand times more.

In this context, it is important to talk about two concepts:

  • LLLMs (Large Language Models ): are large-scale language models, trained on vast amounts of data and capable of performing a wide range of linguistic tasks. Today, we have countless tools based on these LLMs that, by field of expertise, are able to generate programming code, ultra-realistic images and videos, and solve complex mathematical problems. All major companies and organisations in the digital-technology sector have embarked on integrating these tools into their different software and hardware products, developing use cases that solve or optimise specific tasks and activities that previously required a high degree of human intervention.
  • Agents: The user experience with artificial intelligence models is becoming more and more complete, so that we can ask our interface not only to answer our questions, but also to perform complex tasks that require integration with other IT tools. For example, we not only ask a chatbot for information on the best restaurants in the area, but we also ask them to search for table availability for specific dates and make a reservation for us. This extended user experience is what artificial intelligence agentsprovide us with. Based on the large language models, these agents are able to interact with the outside world (to the model) and "talk" to other services via APIs and programming interfaces prepared for this purpose.

Computer use

However, the ability of agents to perform actions autonomously depends on two key elements: on the one hand, their concrete programming - the functionality that has been programmed or configured for them; on the other hand, the need for all other programmes to be ready to "talk" to these agents. That is, their programming interfaces must be ready to receive instructions from these agents. For example, the restaurant reservation application has to be prepared, not only to receive forms filled in by a human, but also requests made by an agent that has been previously invoked by a human using natural language. This fact imposes a limitation on the set of activities and/or tasks that we can automate from a conversational interface. In other words, the conversational interface can provide us with almost infinite answers to the questions we ask it, but it is severely limited in its ability to interact with the outside world due to the lack of preparation of the rest of computer applications.

This is where Computer use comes in. With the arrival of the Claude 3.5 Sonnet model, Anthropic has introduced Computer use, a beta capability that allows AI to interact directly with graphical user interfaces.

How does Computer use work?

Claude can move your computer cursor as if it were you, click buttons and type text, emulating the way humans operate a computer. The best way to understand how Computer use works in practice is to see it in action. Here is a link directly to the YouTube channel of the specific Computer use section.

Figure 1. Screenshot from Anthropic's YouTube channel, Computer use specific section.

Would you like to try it?

If you've made it this far, you can't miss out without trying it with your own hands.

Here is a simple guide to testing Computer use in an isolated environment. It is important to take into account the security recommendations that Antrophic proposes in its Computer use guidelines. This feature of the Claude Sonet model can perform actions on a computer and this can be potentially dangerous, so it is recommended to carefully review the security warning of Computer use.

All official developer documentation can be found in the antrophic's official Github repository. In this post, we have chosen to run Computer use in a Docker container environment. It is the easiest and safest way to test it. If you don't already have it, you can follow the simple official guidelines to pre-install it on your system.

To reproduce this test we propose to follow this script step by step:.

  1. Anthropic API Key. To interact with Claude Sonet you need an Anthropic account which you can create for free here. Once inside, you can go to the API Keys section and create a new one for your test
  1. Once you have your API Key, you must run this command in your terminal, substituting your key where it says "%your_api_key%":

3. If everything went well, you will see these messages in your terminal and now you just have to open your browser and type this url in the navigation bar: htttp://localhost:8080/.

You will see your interface open:

Figure 2. Computer use interface.

You can now go to explore how Computer use works. We suggest the following prompt to get you started:

We suggest you start small. For example, ask them to open a browser and search for something. You can also ask him to give you information about your computer or operating system. Gradually, you can ask for more complex tasks. We have tested this prompt and after several trials we have managed to get Computer use to perform the complete task.

Open a browser, navigate to the datos.gob.es catalogue, use the search engine to locate a dataset on: Public security. Traffic accidents. 2014; Locate the file in csv format; download and open it with free Office.

Potential uses in data platforms such as datos.gob.es

In view of this first experimental version of Computer use, it seems that the potential of the tool is very high. We can imagine how many more things we can do thanks to this tool. Here are some ideas:

  • We could ask the system to perform a complete search of datasets related to a specific topic and summarise the main results in a document. In this way, if for example we write an article on traffic data in Spain, we could unattended obtain a list of the main open datasets of traffic data in Spain in the datos.gob.es catalogue.
  • In the same way, we could request a summary in the same way, but in this case, not of dataset, but of platform items.
  • A slightly more sophisticated example would be to ask Claude, through the conversational interface of Computer use, to make a series of calls to the data API.gob.es to obtain information from certain datasets programmatically. To do this, we open a browser and log into an application such as Postman (remember at this point that Computer use is in experimental mode and does not allow us to enter sensitive data such as user credentials on web pages). We can then ask you to search for information about the datos.gob.es API and execute an http call, taking advantage of the fact that this API does not require authentication.

Through these simple examples, we hope that we have introduced you to a new application of generative AI and that you have understood the paradigm shift that this new capability represents. If the machine is able to emulate the use of a computer as we humans do, unimaginable new opportunities will open up in the coming months.


Content prepared by Alejandro Alija, expert in Digital Transformation and Innovation. The contents and points of view reflected in this publication are the sole responsibility of the author.

calendar icon