Blog

Language models are at the epicentre of the technological paradigm shift that has been taking place in generative artificial intelligence (AI) over the last two years. From the tools with which we interact in natural language to generate text, images or videos and which we use to create creative content, design prototypes or produce educational material, to more complex applications in research and development that have even been instrumental in winning the 2024 Nobel Prize in Chemistry, language models are proving their usefulness in a wide variety of applicationsthat we are still exploring.

Since Google's influential 2017 paper "Attention is all you need" describing the architecture of the Transformers, the technology underpinning the new capabilities that OpenAI popularised in late 2022 with the launch of ChatGPT, the evolution of language models has been more than dizzying. In just two years, we have moved from models focused solely on text generation to multimodal versions that integrate interaction and generation of text, images and audio.

This rapid evolution has given rise to two categories of language models: SLMs (Small Language Models), which are lighter and more efficient, and LLLMs (Large Language Models), which are heavier and more powerful. Far from considering them as competitors, we should analyse SLM and LLM as complementary technologies. While LLLMs offer general processing and content generation capabilities, SLMs can provide support for more agile and specialised solutions for specific needs. However, both share one essential element: they rely on large volumes of data for training and at the heart of their capabilities is open data, which is part of the fuel used to train these language models on which generative AI applications are based.

LLLM: power driven by massive data

The LLLMs are large-scale language models with billions, even trillions, of parameters. These parameters are the mathematical units that allow the model to identify and learn patterns in the training data, giving them an extraordinary ability to generate text (or other formats) that is consistent and adapted to the users' context. These models, such as the GPT family from OpenAI, Gemini from Google or Llama from Meta, are trained on immense volumes of data and are capable of performing complex tasks, some even for which they were not explicitly trained.

Thus, LLMs are able to perform tasks such as generating original content, answering questions with relevant and well-structured information or generating software code, all with a level of competence equal to or higher than humans specialised in these tasks and always maintaining complex and fluent conversations.

The LLLMs rely on massive amounts of data to achieve their current level of performance: from repositories such as Common Crawl, which collects data from millions of web pages, to structured sources such as Wikipedia or specialised sets such as PubMed Open Access in the biomedical field. Without access to these massive bodies of open data, the ability of these models to generalise and adapt to multiple tasks would be much more limited.

However, as LLMs continue to evolve, the need for open data increases to achieve specific advances such as:

  1. Increased linguistic and cultural diversity: although today's LLMs are multilingual, they are generally dominated by data in English and other major languages. The lack of open data in other languages limits the ability of these models to be truly inclusive and diverse. More open data in diverse languages would ensure that LLMs can be useful to all communities, while preserving the world's cultural and linguistic richness.
  2. Reducción de sesgos: los LLM, como cualquier modelo de IA, son propensos a reflejar los sesgos presentes en los datos con los que se entrenan. This sometimes leads to responses that perpetuate stereotypes or inequalities. Incorporating more carefully selected open data, especially from sources that promote diversity and equality, is fundamental to building models that fairly and equitably represent different social groups.
  3. Constant updating: Data on the web and other open resources is constantly changing. Without access to up-to-date data, the LLMs generate outdated responses very quickly. Therefore, increasing the availability of fresh and relevant open data would allow LLMs to keep in line with current events[9].
  4. Entrenamiento más accesible: a medida que los LLM crecen en tamaño y capacidad, también lo hace el coste de entrenarlos y afinarlos. Open data allows independent developers, universities and small businesses to train and refine their own models without the need for costly data acquisitions. This democratises access to artificial intelligence and fosters global innovation.

To address some of these challenges, the new Artificial Intelligence Strategy 2024 includes measures aimed at generating models and corpora in Spanish and co-official languages, including the development of evaluation datasets that consider ethical evaluation.

SLM: optimised efficiency with specific data

On the other hand, SLMs have emerged as an efficient and specialised alternative that uses a smaller number of parameters (usually in the millions) and are designed to be lightweight and fast. Aunque no alcanzan la versatilidad y competencia de los LLM en tareas complejas, los SLM destacan por su eficiencia computacional, rapidez de implementación y capacidad para especializarse en dominios concretos.

For this, SLMs also rely on open data, but in this case, the quality and relevance of the datasets are more important than their volume, so the challenges they face are more related to data cleaning and specialisation. These models require sets that are carefully selected and tailored to the specific domain for which they are to be used, as any errors, biases or unrepresentativeness in the data can have a much greater impact on their performance. Moreover, due to their focus on specialised tasks, the SLMs face additional challenges related to the accessibility of open data in specific fields. For example, in sectors such as medicine, engineering or law, relevant open data is often protected by legal and/or ethical restrictions, making it difficult to use it to train language models.

The SLMs are trained with carefully selected data aligned to the domain in which they will be used, allowing them to outperform LLMs in accuracy and specificity on specific tasks, such as for example:

  • Text autocompletion: a SLM for Spanish autocompletion can be trained with a selection of books, educational texts or corpora such as those to be promoted in the aforementioned AI Strategy, being much more efficient than a general-purpose LLM for this task.
  • Legal consultations: a SLM trained with open legal datasets can provide accurate and contextualised answers to legal questions or process contractual documents more efficiently than a LLM.
  • Customised education: ein the education sector, SLM trained with open data teaching resources can generate specific explanations, personalised exercises or even automatic assessments, adapted to the level and needs of the student.
  • Medical diagnosis: An SLM trained with medical datasets, such as clinical summaries or open publications, can assist physicians in tasks such as identifying preliminary diagnoses, interpreting medical images through textual descriptions or analysing clinical studies.

Ethical Challenges and Considerations

We should not forget that, despite the benefits, the use of open data in language modelling presents significant challenges. One of the main challenges is, as we have already mentioned, to ensure the quality and neutrality of the data so that they are free of biases, as these can be amplified in the models, perpetuating inequalities or prejudices.

Even if a dataset is technically open, its use in artificial intelligence models always raises some ethical implications. For example, it is necessary to avoid that personal or sensitive information is leaked or can be deduced from the results generated by the models, as this could cause damage to the privacy of individuals.
The issue of data attribution and intellectual property must also be taken into account. The use of open data in business models must address how the original creators of the data are recognised and adequately compensated so that incentives for creators continue to exist.

Open data is the engine that drives the amazing capabilities of language models, both SLM and LLM. While the SLMs stand out for their efficiency and accessibility, the LLMs open doors to advanced applications that not long ago seemed impossible. However, the path towards developing more capable, but also more sustainable and representative models depends to a large extent on how we manage and exploit open data.


Contenido elaborado por Jose Luis Marín, Senior Consultant in Data, Strategy, Innovation & Digitalization. Los contenidos y los puntos de vista reflejados en esta publicación son responsabilidad exclusiva de su autor.

calendar icon
Blog

In February 2024, the European geospatial community took a major step forward with the first major update of the INSPIRE implementation schemes in almost a decade. This update, which generates version 5.0 of the schemas, introduces changes that affect the way spatial data are harmonised, transformed and published in Europe. For implementers, policy makers and data users, these changes present both challenges and opportunities.

In this article, we will explain what these changes entail, how they impact on data validation and what steps need to be taken to adapt to this new scenario.

What is INSPIRE and why does it matter?

The INSPIRE Directive (Infrastructure for Spatial Information in Europe) determines the general rules for the establishment of an Infrastructure for Spatial Information in the European Community based on the Member States'' Infrastructures. Adopted by the European Parliament and the Council on March 14, 2007 (Directive 2007/2/EC), it is designed to achieve these objectives by ensuring that spatial information is consistent and accessible across EU member countries.

A key element of INSPIRE is the “application schemas”.These schemas define how data should be structured to comply with INSPIRE standards, ensuring that data from different countries are compatible with each other. In addition, the schemes make data validation easier with official tools, ensuring their quality and compliance with European standards.

What changes with the 5.0 upgrade?

The transition to version 5.0 brings significant modifications, some of which are not backwards compatible. Among the most notable changes are:

  • Removal of mandatory properties: this simplifies data models, but requires implementers to review their previous configurations and adjust the data to comply with the new rules.
  • Renaming of types and properties: with the update of the INSPIRE schemas to version 5.0, some element names and definitions have changed. This means that data that were harmonised following the 4.x schemas no longer exactly match the new specifications. In order to keep these data compliant with current standards, it is necessary to re-transform them using up-to-date tools. This re-transformation ensures that data continues to comply with INSPIRE standards and can be shared and used seamlessly across Europe. The complete table with these updates is as follows
Schema Description of the change Type of change Latest version
ad Changed the data type for the "building" association of the entity type Address. Non-disruptive v4.1
au Removed the enumeration from the schema and changed the encoding of attributes referring to enumerations. Disruptive v5.0
BaseTypes.xsd Removed VerticalPositionValue enumeration from BaseTypes schema. Disruptive v4.0
ef Added a new attribute "thematicId" to the AbstractMonitoringObject spatial object type Non-disruptive v4.1
el-cov Changed the encoding of attributes referring to enumerations. Disruptive v5.0
ElevationBaseTypes.xsd Deleted outline enumeration. Disruptive v5.0.
el-tin Changed the encoding of attributes referring to enumerations. Disruptive v5.0
el-vec Removed the enumeration from the schema and changed the encoding of attributes referring to enumerations. Disruptive v5.0
hh Added new attributes to the EnvHealthDeterminantMeasure type, new entity types and removed some data types. Disruptive v5.0
hy Updated to version 5.0 as the schema imports the hy-p schema which was updated to version 5. Disruptive y non-disruptive v5.0
hyp Changed the data type of the geometry attribute of the DrainageBasin type. Disruptive y non- disruptive v5.0
lcv Added association role to the LandCoverUnit entity type. Disruptive v5.0
mu Changed the encoding of attributes referring to enumerations. Disruptive v4.0
nz-core Removed the enumeration from the schema and changed the encoding of attributes referring to enumerations. Disruptive v5.0
ObservableProperties.xsd Removed the enumeration from the schema and changed the encoding of attributes referring to enumerations. Disruptive v4.0
pf Changed the definition of the ProductionInstallation entity type. Non-disruptive v4.1
plu Fixed typo in the "backgroudMapURI" attribute of the BackgroundMapValue data type. Disruptive v4.0.1
ps Fixed typo in inspireId, added new attribute, and moved attributes to data type. Disruptive v5.0
sr Changed the stereotype of the ShoreSegment object from featureType to datatype. Disruptive v4.0.1
su-vector Added a new attribute StatisticalUnitType to entity type VectorStatisticalUnit Non-disruptive v4.1
tn Removed the enumeration from the schema and changed the encoding of attributes referring to enumerations. Disruptive v5.0
tn-a Changed the data type for the "controlTowers" association of the AerodromeNode entity type. Non-disruptive v4.1
tn-ra Removed enumerations from the schema and changed the encoding of attributes referring to enumerations. Disruptive v5.0
tn-ro Removed enumerations from the schema and changed the encoding of attributes referring to enumerations. Disruptive v5.0
tn-w Removed the abstract stereotype for the entity type TrafficSeparationScheme. Removed enumerations from the schema and changed the encoding of attributes referring to enumerations Disruptive y non disruptive v5.0
us-govserv Updated the version of the imported us-net-common schema (from 4.0 to 5.0). Disruptive v5.0
us-net-common Defined the data type for the authorityRole attribute. Changed the encoding of attributes referring to enumerations. Disruptive v5.0
us-net-el Updated the version of the imported us-net-common schema (from 4.0 to 5.0). Disruptive v5.0
us-net-ogc Updated the version of the imported us-net-common schema (from 4.0 to 5.0). Disruptive v5.0
us-net-sw Updated the version of the imported us-net-common schema (from 4.0 to 5.0). Disruptive v5.0
us-net-th Updated the version of the imported us-net-common schema (from 4.0 to 5.0). Disruptive v5.0
us-net-wa Updated the version of the imported us-net-common schema (from 4.0 to 5.0). Disruptive v5.0

Figure 1. Latest INSPIRE updates.

  • Major changes in version 4.0: although normally a major change in a schema would lead to a new major version (e.g. from 4.0 to 5.0), some INSPIRE schemas in version 4.0 have received significant updates without changing version number. A notable example of this is the Planned Land Use (PLU) scheme. These updates imply that projects and services using the PLU scheme in version 4.0 must be reviewed and modified to adapt to the new specifications. This is particularly relevant for those working with XPlanung, a standard used in urban and land use planning in some European countries. The changes made to the PLU scheme oblige implementers to update their transformation projects and republish data to ensure that they comply with the new INSPIRE rules.

Impact on validation and monitoring

Updating affects not only how data is structured, but also how it is validated. The official INSPIRE tools, such as the Validador, have incorporated the new versions of the schemas, which generates different validation scenarios:

  • Data conforming to previous versions: data harmonised to version 4.x can still pass basic validation tests, but may fail specific tests requiring the use of the updated schemas.
  • Specific tests for updated themes: some themes, such as Protected Sites, require data to follow the most recent versions of the schemas to pass all compliance tests.

In addition, the Joint Research Center (JRC) has indicated that these updated versions will be used in official INSPIRE monitoring from 2025 onwards, underlining the importance of adapting as soon as possible.

What does this mean for consumers?

To ensure that data conforms to the latest versions of the schemas and can be used in European systems, it is essential to take concrete steps:

  • If you are publishing new datasets: use the updated versions of the schemas from the beginning.
  • If you are working with existing data: update the schemas of your datasets to reflect the changes you have made. This may involve adjusting types of features and making new transformations.
  • Publishing services: If your data is already published, you will need to re-transform and republish it to ensure it conforms to the new specifications.

These actions are essential not only to comply with INSPIRE standards, but also to ensure long-term data interoperability.

Conclusion

The update to version 5.0 of the INSPIRE schemas represents a technical challenge, but also an opportunity to improve the interoperability and usability of spatial data in Europe. Adopting these modifications not only ensures regulatory compliance, but also positions implementers as leaders in the modernisation of spatial data infrastructure.

Although the updates may seem complex, they have a clear purpose: to strengthen the interoperability of spatial data in Europe. With better harmonised data and updated tools, it will be easier for governments, businesses and organisations to collaborate and make informed decisions on crucial issues such as sustainability, land management and climate change.

Furthermore, these improvements reinforce INSPIRE''s commitment to technological innovation, making European spatial data more accessible, useful and relevant in an increasingly interconnected world.


Content prepared by Mayte Toscano, Senior Consultant in Data Economy Technologies. The contents and points of view reflected in this publication are the sole responsibility of its author.

calendar icon
Blog

Data governance is crucial for the digital transformation of organisations.  It is developed through various axes within the organisation, forming an integral part of the organisational digital transformation plan. In a world where organisations need to constantly reinvent themselves and look for new business models and opportunities to innovate, data governance becomes a key part of moving towards a fairer and more inclusive digital economy, while remaining competitive. 

Organisations need to maximise the value of their data, identify new challenges and manage the role of data in the use and development of disruptive technologies such as Artificial Intelligence. Thanks to data governance, it is possible to make informed decisions, improve operational efficiency and ensure regulatory compliance, while ensuring data security and privacy.

To achieve this, it is essential to carry out a planned digital transformation, centred on a strategic data governance plan that complements the organisation's strategic plan. The UNE 0085 guide helps to implement data governance in any organisation and does so by placing special emphasis on the design of the programme through an evaluation cycle based on gap analysis, which must be relevant and decisive for senior management to approve the launch of the programme.

The data governance office, key body of the programme

A data governance programme should identify what data is critical to the organisation, where it resides and how it is used.. This must be accompanied by a management system that coordinates the deployment of data governance, management and quality processes. An integrated approach with other management systems that the organisation may have, such as the business continuity management system or the information security system, is necessary.

The Data Governance Office is the area in charge of coordinating the development of the different components of the data governance and management system, i.e. it is the area that participates in the creation of the guidelines, rules and policies that allow the appropriate treatment of data, as well as ensuring compliance with the different regulations.

The Data Governance Office should be a key body of the programme. It serves as a bridge between business areas, coordinating data owners and data stewards at the organisational level.

UNE 0085: guidelines for implementing data governance

Implementing a data governance programme is not an easy task. To help organisations with this challenge, the new UNE 0085  has been developed, which follows a process approach as opposed to an artefact approach and summarises as a guide the steps to follow to implement such a programme, thus complementing the family of UNE standards on data governance, management and quality 0077, 0078, 0079 and 0080.

This guide:

  • It emphasises the importance of the programme being born aligned with the strategic objectives of the organisation, with strong sponsorship.
  • Describes at a high level the key aspects that should be covered by the programme.
  • Detalla diferentes escenarios tipo, que pueden ayudar a una organización a clarificar por dónde empezar y qué iniciativas debería priorizar, el modelo operativo y roles que necesitará para el despliegue.
  • It presents the design of the data governance programme through an evaluation cycle based on gap analysis.  It starts with an initial assessment phase (As Is) to show the starting situation of the organisation followed by a second phase in which the scope and objectives of the programme are defined and aligned with the strategic objectives of the organisation phase (To be), to carry out the gap analysis phase. It ends with a business case that includes deliverables such as scope, frameworks, programme objectives and milestones, budget, roadmap and measurable benefits with associated KPIs among other aspects. This use case will serve as the launch of the data governance programme by management and thus its implementation throughout the organisation. The different phases of the cycle in relation to the UNE 0077 data governance system are presented below:

EVALUATION PHASES FOR PROGRAM DESIGN (UNE 0085) IN RELATION TO THE DATA GOVERNANCE SYSTEM (UNE 0077)  AS → TO BE → GAP ANALYSIS → ROADMAP → BUSINESS CASE  UNDERSTAND THE CONTEXT AND THE ORGANIZATION→ 3. PERFORM GAP ANALYSIS  → 4. REFINING THE DESIGN OF THE GOVERNANCE SYSTEM  Source: “Specification UNE 0085 Data Governance Implementation Guide”, General Directorate of Data (2024)”

Finally, beyond processes and systems, we cannot forget people and the roles they play in this digital transformation. Data controllers and the entities involved are central to this organisational culture change. It is necessary to manage this change effectively in order to deploy a data governance operating model that fits the needs of each organisation.

It may seem complex to orchestrate and define an exercise of this magnitude, especially with abstract concepts related to data governance. This is where the new data governance office, which each organisation must establish, comes into play. This office will assist in these essential tasks, always following the appropriate frameworks and standards.

It is recommended to follow a methodology that facilitates this work, such as the UNE specifications for data governance, management and quality (0077, 0078, 0079 and 0080). These specifications are now complemented by the new UNE 0085, a practical implementation guide that can be downloaded free of charge from the AENOR website.

The content of this guide can be downloaded freely and free of charge from the AENOR portal through the link below by accessing the purchase section. Access to this family of UNE data specifications is sponsored by the Secretary of State for Digitalization and Artificial Intelligence, Directorate General for Data. Although the download requires prior registration, a 100% discount on the total price is applied at the time of finalizing the purchase. After finalizing the purchase, the selected standard or standards can be accessed from the customer area in “my products” section.

calendar icon
Blog

Today's climate crisis and environmental challenges demand innovative and effective responses. In this context, the European Commission's Destination Earth (DestinE) initiative is a pioneering project that aims to develop a highly accurate digital model of our planet.

Through this digital twin of the Earth it will be possible to monitor and prevent potential natural disasters, adapt sustainability strategies and coordinate humanitarian efforts, among other functions. In this post, we analyse what the project consists of and the state of development of the project.

Features and components of Destination Earth

Aligned with the European Green Pact and the Digital Europe Strategy, Destination Earth integrates digital modeling and climate science to provide a tool that is useful in addressing environmental challenges. To this end, it has a focus on accuracy, local detail and speed of access to information.

In general, the tool allows:

  • Monitor and simulate Earth system developments, including land, sea, atmosphere and biosphere, as well as human interventions.
  • To anticipate environmental disasters and socio-economic crises, thus enabling the safeguarding of lives and the prevention of significant economic downturns.
  • Generate and test scenarios that promote more sustainable development in the future.

To do this, DestinE is subdivided into three main components :

  • Data lake:
    • What is it? A centralised repository to store data from a variety of sources, such as the European Space Agency (ESA), EUMETSAT and Copernicus, as well as from the new digital twins.
    • What does it provide? This infrastructure enables the discovery and access to data, as well as the processing of large volumes of information in the cloud.

·The DestinE Platform:.

  • What is it? A digital ecosystem that integrates services, data-driven decision-making tools and an open, flexible and secure cloud computing infrastructure.
  • What does it provide? Users have access to thematic information, models, simulations, forecasts and visualisations that will facilitate a deeper understanding of the Earth system.
  • Digital cufflinks and engineering:
    • What are they? There are several digital replicas covering different aspects of the Earth system. The first two are already developed, one on climate change adaptation and the other on extreme weather events.
    • WHAT DOES IT PROVIDE? These twins offer multi-decadal simulations (temperature variation) and high-resolution forecasts.

Discover the services and contribute to improve DestinE

The DestinE platform offers a collection of applications and use cases developed within the framework of the initiative, for example:

  • Digital twin of tourism (Beta): it allows to review and anticipate the viability of tourism activities according to the environmental and meteorological conditions of its territory.
  • VizLab: offers an intuitive graphical user interface and advanced 3D rendering technologies to provide a storytelling experience by making complex datasets accessible and understandable to a wide audience..
  • miniDEA: is an interactive and easy-to-use DEA-based web visualisation app for previewing DestinE data.
  • GeoAI: is a geospatial AI platform for Earth observation use cases.
  • Global Fish Tracking System (GFTS)is a project to help obtain accurate information on fish stocks in order to develop evidence-based conservation policies.
  • More resilient urban planning: is a solution that provides a heat stress index that allows urban planners to understand best practices for adapting to extreme temperatures in urban environments..
  • Danube Delta Water Reserve Monitoring: is a comprehensive and accurate analysis based on the DestinE data lake to inform conservation efforts in the Danube Delta, one of the most biodiverse regions in Europe.

Since October this year, the DestinE platform has been accepting registrations, a possibility that allows you to explore the full potential of the tool and access exclusive resources. This option serves to record feedback and improve the project system.

To become a user and be able to generate services, you must follow these steps..

Project roadmap:

The European Union sets out a series of time-bound milestones that will mark the development of the initiative:

  • 2022 - Official launch of the project.
  • 2023 - Start of development of the main components.
  • 2024 - Development of all system components. Implementation of the DestinE platform and data lake. Demonstration.
  • 2026 - Enhancement of the DestinE system, integration of additional digital twins and related services.
  • 2030 - Full digital replica of the Earth.

Destination Earth not only represents a technological breakthrough, but is also a powerful tool for sustainability and resilience in the face of climate challenges. By providing accurate and accessible data, DestinE enables data-driven decision-making and the creation of effective adaptation and mitigation strategies.

calendar icon
Blog

There is no doubt that data has become the strategic asset for organisations. Today, it is essential to ensure that decisions are based on quality data, regardless of the alignment they follow: data analytics, artificial intelligence or reporting. However, ensuring data repositories with high levels of quality is not an easy task, given that in many cases data come from heterogeneous sources where data quality principles have not been taken into account and no context about the domain is available.

To alleviate as far as possible this casuistry, in this article, we will explore one of the most widely used libraries in data analysis: Pandas. Let's check how this Python library can be an effective tool to improve data quality. We will also review the relationship of some of its functions with the data quality dimensions and properties included in the UNE 0081 data quality specification, and some concrete examples of its application in data repositories with the aim of improving data quality.

Using Pandas for data profiling

Si bien el data profiling y la evaluación de calidad de datos están estrechamente relacionados, sus enfoques son diferentes:

  • Data Profiling: is the process of exploratory analysis performed to understand the fundamental characteristics of the data, such as its structure, data types, distribution of values, and the presence of missing or duplicate values. The aim is to get a clear picture of what the data looks like, without necessarily making judgements about its quality.
  • Data quality assessment: involves the application of predefined rules and standards to determine whether data meets certain quality requirements, such as accuracy, completeness, consistency, credibility or timeliness. In this process, errors are identified and actions to correct them are determined. A useful guide for data quality assessment is the UNE 0081 specification.

It consists of exploring and analysing a dataset to gain a basic understanding of its structure, content and characteristics, before conducting a more in-depth analysis or assessment of the quality of the data. The main objective is to obtain an overview of the data by analysing the distribution, types of data, missing values, relationships between columns and detection of possible anomalies. Pandas has several functions to perform this data profiling.

En resumen, el data profiling es un paso inicial exploratorio que ayuda a preparar el terreno para una evaluación más profunda de la calidad de los datos, proporcionando información esencial para identificar áreas problemáticas y definir las reglas de calidad adecuadas para la evaluación posterior.

What is Pandas and how does it help ensure data quality?

Pandas is one of the most popular Python libraries for data manipulation and analysis. Its ability to handle large volumes of structured information makes it a powerful tool in detecting and correcting errors in data repositories. With Pandas, complex operations can be performed efficiently, from data cleansing to data validation, all of which are essential to maintain quality standards. The following are some examples of how to improve data quality in repositories with Pandas:

1. Detection of missing or inconsistent values: One of the most common data errors is missing or inconsistent values. Pandas allows these values to be easily identified by functions such as isnull() or dropna(). This is key for the completeness property of the records and the data consistency dimension, as missing values in critical fields can distort the results of the analyses.

  1. # Identify null values in a dataframe.

    df.isnull().sum()

2. Data standardisation and normalisation: Errors in naming or coding consistency are common in large repositories. For example, in a dataset containing product codes, some may be misspelled or may not follow a standard convention. Pandas provides functions like merge() to perform a comparison with a reference database and correct these values. This option is key to maintaining the dimension and semantic consistency property of the data.

# Substitution of incorrect values using a reference table

df = df.merge(product_codes, left_on='product_code', right_on='ref_code', how= 'left')

3. Validation of data requirements: Pandas allows the creation of customised rules to validate the compliance of data with certain standards. For example, if an age field should only contain positive integer values, we can apply a function to identify and correct values that do not comply with this rule. In this way, any business rule of any of the data quality dimensions and properties can be validated.

# Identify records with invalid age values (negative or decimals)

age_errors = df[(df['age'] < 0) | (df['age'] % 1 != 0)])

4. Exploratory analysis to identify anomalous patterns: Functions such as describe() or groupby() in Pandas allow you to explore the general behaviour of your data. This type of analysis is essential for detecting anomalous or out-of-range patterns in any data set, such as unusually high or low values in columns that should follow certain ranges.

# Statistical summary of the data

df.describe()

#Sort by category or property

df.groupby()

5. Duplication removal: Duplicate data is a common problem in data repositories. Pandas provides methods such as drop_duplicates() to identify and remove these records, ensuring that there is no redundancy in the dataset. This capacity would be related to the dimension of completeness and consistency.

# Remove duplicate rows

df = df.drop_duplicates()

Practical example of the application of Pandas

Having presented the above functions that help us to improve the quality of data repositories, we now consider a case to put the process into practice. Suppose we are managing a repository of citizens' data and we want to ensure:

  1. Age data should not contain invalid values (such as negatives or decimals).
  2. That nationality codes are standardised.
  3. That the unique identifiers follow a correct format.
  4. The place of residence must be consistent.

With Pandas, we could perform the following actions:

1. Age validation without incorrect values:

# Identify records with ages outside the allowed ranges (e.g. less than 0 or non-integers)

age_errors = df[(df['age'] < 0) | (df['age'] % 1 != 0)])

2. Correction of nationality codes:

# Use of an official dataset of nationality codes to correct incorrect entries

df_corregida = df.merge(nacionalidades_ref, left_on='nacionalidad', right_on='codigo_ref', how='left')

3. Validation of unique identifiers:

# Check if the format of the identification number follows a correct pattern

df['valid_id'] = df['identificacion'].str.match(r'^[A-Z0-9]{8}$')

errores_id = df[df['valid_id'] == False]

4. Verification of consistency in place of residence:

# Detect possible inconsistencies in residency (e.g. the same citizen residing in two places at the same time).

duplicados_residencia = df.groupby(['id_ciudadano', 'fecha_residencia'])['lugar_residencia'].nunique()

inconsistencias_residencia = duplicados_residencia[duplicados_residencia > 1]

Integration with a variety of technologies

Pandas is an extremely flexible and versatile library that integrates easily with many technologies and tools in the data ecosystem. Some of the main technologies with which Pandas is integrated or can be used are:

  1. SQL databases:

Pandas integrates very well with relational databases such as MySQL, PostgreSQL, SQLite, and others that use SQL. The SQLAlchemy library or directly the database-specific libraries (such as psycopg2 for PostgreSQL or sqlite3) allow you to connect Pandas to these databases, perform queries and read/write data between the database and Pandas.

  • Common function: pd.read_sql() to read a SQL query into a DataFrame, and to_sql() to export the data from Pandas to a SQL table.
  1. REST and HTTP-based APIs:

Pandas can be used to process data obtained from APIs using HTTP requests. Libraries such as requests allow you to get data from APIs and then transform that data into Pandas DataFrames for analysis.

  1. Big Data (Apache Spark):

Pandas can be used in combination with PySpark, an API for Apache Spark in Python. Although Pandas is primarily designed to work with in-memory data, Koalas, a library based on Pandas and Spark, allows you to work with Spark distributed structures using a Pandas-like interface. Tools like Koalas help Pandas users scale their scripts to distributed data environments without having to learn all the PySpark syntax.

  1. Hadoop and HDFS:

Pandas can be used in conjunction with Hadoop technologies, especially the HDFS distributed file system. Although Pandas is not designed to handle large volumes of distributed data, it can be used in conjunction with libraries such as pyarrow or dask to read or write data to and from HDFS on distributed systems. For example, pyarrow can be used to read or write Parquet files in HDFS.

  1. Popular file formats:

Pandas is commonly used to read and write data in different file formats, such as:

  • CSV: pd.read_csv()
  • Excel: pd.read_excel() and to_excel().
  • JSON: pd.read_json()
  • Parquet: pd.read_parquet() for working with space and time efficient files.
  • Feather: a fast file format for interchange between languages such as Python and R (pd.read_feather()).
  1. Data visualisation tools:

Pandas can be easily integrated with visualisation tools such as Matplotlib, Seaborn, and Plotly.. These libraries allow you to generate graphs directly from Pandas DataFrames.

  • Pandas includes its own lightweight integration with Matplotlib to generate fast plots using df.plot().
  • For more sophisticated visualisations, it is common to use Pandas together with Seaborn or Plotly for interactive graphics.
  1.  Machine learning libraries:

Pandas is widely used in pre-processing data before applying machine learning models. Some popular libraries with which Pandas integrates are:

  • Scikit-learn: la mayoría de los pipelines de machine learning comienzan con la preparación de datos en Pandas antes de pasar los datos a modelos de Scikit-learn.
  • TensorFlow y PyTorch: aunque estos frameworks están más orientados al manejo de matrices numéricas (Numpy), Pandas se utiliza frecuentemente para la carga y limpieza de datos antes de entrenar modelos de deep learning.
  • XGBoost, LightGBM, CatBoost: Pandas supports these high-performance machine learning libraries, where DataFrames are used as input to train models.
  1. Jupyter Notebooks:

Pandas is central to interactive data analysis within Jupyter Notebooks, which allow you to run Python code and visualise the results immediately, making it easy to explore data and visualise it in conjunction with other tools.

  1. Cloud Storage (AWS, GCP, Azure):

Pandas can be used to read and write data directly from cloud storage services such as Amazon S3, Google Cloud Storage and Azure Blob Storage. Additional libraries such as boto3 (for AWS S3) or google-cloud-storage facilitate integration with these services. Below is an example for reading data from Amazon S3.

import pandas as pd

import boto3

#Create an S3 client

s3 = boto3.client('s3')

#Obtain an object from the bucket

obj = s3.get_object(Bucket='mi-bucket', Key='datos.csv')

#Read CSV file from a DataFrame

df = pd.read_csv(obj['Body'])

 

10. Docker and containers:

 

Pandas can be used in container environments using Docker.. Containers are widely used to create isolated environments that ensure the replicability of data analysis pipelines .

In conclusion, the use of Pandas is an effective solution to improve data quality in complex and heterogeneous repositories. Through clean-up, normalisation, business rule validation, and exploratory analysis functions, Pandas facilitates the detection and correction of common errors, such as null, duplicate or inconsistent values. In addition, its integration with various technologies, databases, big dataenvironments, and cloud storage, makes Pandas an extremely versatile tool for ensuring data accuracy, consistency and completeness.


Content prepared by Dr. Fernando Gualo, Professor at UCLM and Data Governance and Quality Consultant. The content and point of view reflected in this publication is the sole responsibility of its author.

calendar icon
Blog

Natural language processing (NLP) is a branch of artificial intelligence that allows machines to understand and manipulate human language. At the core of many modern applications, such as virtual assistants, machine translation and chatbots, are word embeddings. But what exactly are they and why are they so important?

What are word embeddings?

Word embeddings are a technique that allows machines to represent the meaning of words in such a way that complex relationships between words can be captured. To understand this, let's think about how words are used in a given context: a word acquires meaning depending on the words surrounding it. For example, the word bank can refer to a financial institution or to a headquarters, depending on the context in which it is found.

The idea behind word embeddings is that each word is assigned a vector in a multi-dimensional space. The position of these vectors in space reflects the semantic closeness between the words. If two words have similar meanings, their vectors will be close. If their meanings are opposite or unrelated, they are distant in vector space.

To visualise this, imagine that words like lake, river and ocean would be close together in this space, while words like lake and building would be much further apart. This structure enables language processing algorithms to perform complex tasks, such as finding synonyms, making accurate translations or even answering context-based questions.

How are word embeddings created?

The main objective of word embeddings is to capture semantic relationships and contextual information of words, transforming them into numerical representations that can be understood by machine learning algorithms. Instead of working with raw text, machines require words to be converted into numbers in order to identify patterns and relationships effectively.

The process of creating word embeddings consists of training a model on a large corpus of text, such as Wikipedia articles or news items, to learn the structure of the language. The first step involves performing a series of pre-processing on the corpus, which includes tokenise the words, removing punctuation and irrelevant terms, and, in some cases, converting the entire text to lower case to maintain consistency.

The use of context to capture meaning

Once the text has been pre-processed, a technique known as "sliding context window" is used to extract information. This means that, for each target word, the surrounding words within a certain range are taken into account. For example, if the context window is 3 words, for the word airplane in the sentence "The plane takes off at six o'clock", the context words will be The, takes off, to.

The model is trained to learn to predict a target word using the words in its context (or conversely, to predict the context from the target word). To do this, the algorithm adjusts its parameters so that the vectors assigned to each word are closer in vector space if those words appear frequently in similar contexts.

How models learn language structure

The creation of word embeddings is based on the ability of these models to identify patterns and semantic relationships. During training, the model adjusts the values of the vectors so that words that often share contexts have similar representations. For example, if airplane and helicopter are frequently used in similar phrases (e.g. in the context of air transport), the vectors of airplane and helicopter will be close together in vector space.

As the model processes more and more examples of sentences, it refines the positions of the vectors in the continuous space. Thus, the vectors reflect not only semantic proximity, but also other relationships such as synonyms, categories (e.g., fruits, animals) and hierarchical relationships (e.g., dog and animal).

A simplified example

Imagine a small corpus of only six words: guitar, bass, drums, piano, car and bicycle. Suppose that each word is represented in a three-dimensional vector space as follows:

guitar      [0.3, 0.8, -0.1]

bass       [0.4, 0.7, -0.2]

 drums    [0.2, 0.9, -0.1]

piano      [0.1, 0.6, -0.3]

car          [0.8, -0.1, 0.6]

bicycle    [0.7, -0.2, 0.5]

In this simplified example, the words guitar, bass, drums and piano represent musical instruments and are located close to each other in vector space, as they are used in similar contexts. In contrast, car and bicycle, which belong to the category of means of transport, are distant from musical instruments but close to each other. This other image shows how different terms related to sky, wings and engineering would look like in a vector space. 

Figure 1. Examples of representation of a corpus in a vector space. The image shows the representation of words on three axes: sky, wings and engineering. On the axis between wings and sky are: bee (4,0,2), eagle (3,0,3) and goose (2,0,4). On the axis between sky and engineering: rocket (0,4,2), drone (0,3,3) and helicopter (0,4,2). On the other hand, jet (1,1,1) covers all three axes. Source: “Word embeddings: the (very) basics”, by Guillaume Desagulier.

Figure 1. Examples of representation of a corpus in a vector space. Source: Adapted from “Word embeddings: the (very) basics”, by Guillaume Desagulier.

This example only uses three dimensions to illustrate the idea, but in practice, word embeddings usually have between 100 and 300 dimensions to capture more complex semantic relationships and linguistic nuances.

The result is a set of vectors that efficiently represent each word, allowing language processing models to identify patterns and semantic relationships more accurately. With these vectors, machines can perform advanced tasks such as semantic search, text classification and question answering, significantly improving natural language understanding.

Strategies for generating word embeddings

Over the years, multiple approaches and techniques have been developed to generate word embeddings. Each strategy has its own way of capturing the meaning and semantic relationships of words, resulting in different characteristics and uses. Some of the main strategies are presented below:

1. Word2Vec: local context capture

Developed by Google, Word2Vec is one of the most popular approaches and is based on the idea that the meaning of a word is defined by its context. It uses two main approaches:

  • CBOW (Continuous Bag of Words): In this approach, the model predicts the target word using the words in its immediate environment. For example, given a context such as "The dog is ___ in the garden", the model attempts to predict the word playing, based on the words The, dog, is and garden.
  • Skip-gram: Conversely, Skip-gram uses a target word to predict the surrounding words. Using the same example, if the target word is playing, the model would try to predict that the words in its environment are The, dog, is and garden.

The key idea is that Word2Vec trains the model to capture semantic proximity across many iterations on a large corpus of text. Words that tend to appear together have closer vectors, while unrelated words appear further apart.

2. GloVe: global statistics-based approach

GloVe, developed at Stanford University, differs from Word2Vec by using global co-occurrence statistics of words in a corpus. Instead of considering only the immediate context, GloVe is based on the frequency with which two words appear together in the whole corpus.

For example, if bread and butter appear together frequently, but bread and planet are rarely found in the same context, the model adjusts the vectors so that bread and butter are close together in vector space.

This allows GloVe to capture broader global relationships between words and to make the representations more robust at the semantic level. Models trained with GloVe tend to perform well on analogy and word similarity tasks.

3. FastText: subword capture

FastText, developed by Facebook, improves on Word2Vec by introducing the idea of breaking down words into sub-words. Instead of treating each word as an indivisible unit, FastText represents each word as a sum of n-grams. For example, the word playing could be broken down into play, ayiing, and so on.

This allows FastText to capture similarities even between words that did not appear explicitly in the training corpus, such as morphological variations (playing, play, player). This is particularly useful for languages with many grammatical variations.

4. Embeddings contextuales: dynamic sense-making

Models such as BERT and ELMo represent a significant advance in word embeddings. Unlike the previous strategies, which generate a single vector for each word regardless of the context, contextual embeddings generate different vectors for the same word depending on its use in the sentence.

For example, the word bank will have a different vector in the sentence "I sat on the park bench" than in "the bank approved my credit application". This variability is achieved by training the model on large text corpora in a bidirectional manner, i.e. considering not only the words preceding the target word, but also those following it.

Practical applications of word embeddings

ord embeddings are used in a variety of natural language processing applications, including:

  • Named Entity Recognition (NER): allows you to identify and classify names of people, organisations and places in a text. For example, in the sentence "Apple announced its new headquarters in Cupertino", the word embeddings allow the model to understand that Apple is an organisation and Cupertino is a place.
  • Automatic translation: helps to represent words in a language-independent way. By training a model with texts in different languages, representations can be generated that capture the underlying meaning of words, facilitating the translation of complete sentences with a higher level of semantic accuracy.
  • Information retrieval systems: in search engines and recommender systems, word embeddings improve the match between user queries and relevant documents. By capturing semantic similarities, they allow even non-exact queries to be matched with useful results. For example, if a user searches for "medicine for headache", the system can suggest results related to analgesics thanks to the similarities captured in the vectors.
  • Q&A systems:  word embeddings are essential in systems such as chatbots and virtual assistants, where they help to understand the intent behind questions and find relevant answers. For example, for the question "What is the capital of Italy?", the word embeddings allow the system to understand the relationship between capital and Italy and find Rome as an answer.
  • Sentiment analysis:  word embeddings are used in models that determine whether the sentiment expressed in a text is positive, negative or neutral. By analysing the relationships between words in different contexts, the model can identify patterns of use that indicate certain feelings, such as joy, sadness or anger.
  • Semantic clustering and similarity detection:  word embeddings also allow you to measure the semantic similarity between documents, phrases or words. This is used for tasks such as grouping related items, recommending products based on text descriptions or even detecting duplicates and similar content in large databases.

Conclusion

Word embeddings have transformed the field of natural language processing by providing dense and meaningful representations of words, capable of capturing their semantic and contextual relationships. With the emergence of contextual embeddings , the potential of these representations continues to grow, allowing machines to understand even the subtleties and ambiguities of human language. From applications in translation and search systems, to chatbots and sentiment analysis, word embeddings will continue to be a fundamental tool for the development of increasingly advanced and humanised natural language technologies.


Content prepared by Juan Benavente, senior industrial engineer and expert in technologies linked to the data economy. The contents and points of view reflected in this publication are the sole responsibility of the author.

calendar icon
Blog

A digital twin is a virtual, interactive representation of a real-world object, system or process. We are talking, for example, about a digital replica of a factory, a city or even a human body. These virtual models allow simulating, analysing and predicting the behaviour of the original element, which is key for optimisation and maintenance in real time.

Due to their functionalities, digital twins are being used in various sectors such as health, transport or agriculture. In this article, we review the benefits of their use and show two examples related to open data.

Advantages of digital twins

Digital twins use real data sources from the environment, obtained through sensors and open platforms, among others. As a result, the digital twins are updated in real time to reflect reality, which brings a number of advantages:

  • Increased performance: one of the main differences with traditional simulations is that digital twins use real-time data for modelling, allowing better decisions to be made to optimise equipment and system performance according to the needs of the moment.
  • Improved planning: using technologies based on artificial intelligence (AI) and machine learning, the digital twin can analyse performance issues or perform virtual "what-if" simulations. In this way, failures and problems can be predicted before they occur, enabling proactive maintenance.
  • Cost reduction: improved data management thanks to a digital twin generates benefits equivalent to 25% of total infrastructure expenditure. In addition, by avoiding costly failures and optimizing processes, operating costs can be significantly reduced. They also enable remote monitoring and control of systems from anywhere, improving efficiency by centralizing operations.
  • Customization and flexibility: by creating detailed virtual models of products or processes, organizations can quickly adapt their operations to meet changing environmental demands and individual customer/citizen preferences. For example, in manufacturing, digital twins enable customized mass production, adjusting production lines in real time to create unique products according to customer specifications. On the other hand, in healthcare, digital twins can model the human body to customize medical treatments, thereby improving efficacy and reducing side effects.
  • Boosting experimentation and innovation: digital twins provide a safe and controlled environment for testing new ideas and solutions, without the risks and costs associated with physical experiments. Among other issues, they allow experimentation with large objects or projects that, due to their size, do not usually lend themselves to real-life experimentation.
  • Improved sustainability: by enabling simulation and detailed analysis of processes and systems, organizations can identify areas of inefficiency and waste, thus optimizing the use of resources. For example, digital twins can model energy consumption and production in real time, enabling precise adjustments that reduce consumption and carbon emissions.

Examples of digital twins in Spain

The following three examples illustrate these advantages.

GeDIA project: artificial intelligence to predict changes in territories

GeDIA is a tool for strategic planning of smart cities, which allows scenario simulations. It uses artificial intelligence models based on existing data sources and tools in the territory.

The scope of the tool is very broad, but its creators highlight two use cases:

  1. Future infrastructure needs: the platform performs detailed analyses considering trends, thanks to artificial intelligence models. In this way, growth projections can be made and the needs for infrastructures and services, such as energy and water, can be planned in specific areas of a territory, guaranteeing their availability.
  2. Growth and tourism: GeDIA is also used to study and analyse urban and tourism growth in specific areas. The tool identifies patterns of gentrification and assesses their impact on the local population, using census data. In this way, demographic changes and their impact, such as housing needs, can be better understood and decisions can be made to facilitate equitable and sustainable growth.

This initiative has the participation of various companies and the University of Malaga (UMA), as well as the financial backing of Red.es and the European Union.

Digital twin of the Mar Menor: data to protect the environment

The Mar Menor, the salt lagoon of the Region of Murcia, has suffered serious ecological problems in recent years, influenced by agricultural pressure, tourism and urbanisation.

To better understand the causes and assess possible solutions, TRAGSATEC, a state-owned environmental protection agency, developed a digital twin. It mapped a surrounding area of more than 1,600 square kilometres, known as the Campo de Cartagena Region. In total, 51,000 nadir images, 200,000 oblique images and more than four terabytes of LiDAR data were obtained.

Thanks to this digital twin, TRAGSATEC has been able to simulate various flooding scenarios and the impact of installing containment elements or obstacles, such as a wall, to redirect the flow of water. They have also been able to study the distance between the soil and the groundwater, to determine the impact of fertiliser seepage, among other issues.

Challenges and the way forward

These are just two examples, but they highlight the potential of an increasingly popular technology. However, for its implementation to be even greater, some challenges need to be addressed, such as initial costs, both in technology and training, or security, by increasing the attack surface. Another challenge is the interoperability problems that arise when different public administrations establish digital twins and local data spaces. To address this issue further, the European Commission has published a guide that helps to identify the main organisational and cultural challenges to interoperability, offering good practices to overcome them.

In short, digital twins offer numerous advantages, such as improved performance or cost reduction. These benefits are driving their adoption in various industries and it is likely that, as current challenges are overcome, digital twins will become an essential tool for optimising processes and improving operational efficiency in an increasingly digitised world.

calendar icon
Blog

Almost half of European adults lack basic digital skills. According to the latest State of the Digital Decade report, in 2023, only 55.6% of citizens reported having such skills. This percentage rises to 66.2% in the case of Spain, ahead of the European average.

Having basic digital skills is essential in today's society because it enables access to a wider range of information and services, as well as effective communication in onlineenvironments, facilitating greater participation in civic and social activities. It is also a great competitive advantage in the world of work.

In Europe, more than 90% of professional roles require a basic level of digital skills. Technological knowledge has long since ceased to be required only for technical professions, but is spreading to all sectors, from business to transport and even agriculture. In this respect, more than 70% of companies said that the lack of staff with the right digital skills is a barrier to investment.

A key objective of the Digital Decade is therefore to ensure that at least 80% of people aged 16-74 have at least basic digital skills by 2030.

Basic technology skills that everyone should have

When we talk about basic technological capabilities, we refer, according to the DigComp framework , to a number of areas, including:

  • Information and data literacy: includes locating, retrieving, managing and organising data, judging the relevance of the source and its content.
  • Communication and collaboration: involves interacting, communicating and collaborating through digital technologies taking into account cultural and generational diversity. It also includes managing one's own digital presence, identity and reputation.
  • Digital content creation: this would be defined as the enhancement and integration of information and content to generate new messages, respecting copyrights and licences. It also involves knowing how to give understandable instructions to a computer system.
  • Security: this is limited to the protection of devices, content, personal data and privacy in digital environments, to protect physical and mental health.
  • Problem solving: it allows to identify and solve needs and problems in digital environments. It also focuses on the use of digital tools to innovate processes and products, keeping up with digital evolution.

Which data-related jobs are most in demand?

Now that the core competences are clear, it is worth noting that in a world where digitalisation is becoming increasingly important , it is not surprising that the demand for advanced technological and data-related skills is also growing.

According to data from the LinkedIn employment platform, among the 25 fastest growing professions in Spain in 2024 are security analysts (position 1), software development analysts (2), data engineers (11) and artificial intelligence engineers (25). Similar data is offered by Fundación Telefónica's Employment Map, which also highlights four of the most in-demand profiles related to data:

  • Data analyst: responsible for the management and exploitation of information, they are dedicated to the collection, analysis and exploitation of data, often through the creation of dashboards and reports.
  • Database designer or database administrator: focused on designing, implementing and managing databases. As well as maintaining its security by implementing backup and recovery procedures in case of failures.
  • Data engineer: responsible for the design and implementation of data architectures and infrastructures to capture, store, process and access data, optimising its performance and guaranteeing its security.
  • Data scientist: focused on data analysis and predictive modelling, optimisation of algorithms and communication of results.

These are all jobs with good salaries and future prospects, but where there is still a large gap between men and women. According to European data, only 1 in 6 ICT specialists and 1 in 3 science, technology, engineering and mathematics (STEM) graduates are women.

To develop data-related professions, you need, among others, knowledge of popular programming languages such as Python, R or SQL, and multiple data processing and visualisation tools, such as those detailed in these articles:

The range of training courses on all these skills is growing all the time.

Future prospects

Nearly a quarter of all jobs (23%) will change in the next five years, according to the World Economic Forum's Future of Jobs 2023 Report.  Technological advances will create new jobs, transform existing jobs and destroy those that become obsolete. Technical knowledge, related to areas such as artificial intelligence or Big Data, and the development of cognitive skills, such as analytical thinking, will provide great competitive advantages in the labour market of the future. In this context, policy initiatives to boost society's re-skilling , such as the European Digital Education Action Plan (2021-2027), will help to generate common frameworks and certificates in a constantly evolving world.

The technological revolution is here to stay and will continue to change our world. Therefore, those who start acquiring new skills earlier will be better positioned in the future employment landscape.

calendar icon
Blog

Citizen science is consolidating itself as one of the most relevant sources of most relevant sources of reference in contemporary research contemporary research. This is recognised by the Centro Superior de Investigaciones Científicas (CSIC), which defines citizen science as a methodology and a means for the promotion of scientific culture in which science and citizen participation strategies converge.

We talked some time ago about the importance importance of citizen science in society in society. Today, citizen science projects have not only increased in number, diversity and complexity, but have also driven a significant process of reflection on how citizens can actively contribute to the generation of data and knowledge.

To reach this point, programmes such as Horizon 2020, which explicitly recognised citizen participation in science, have played a key role. More specifically, the chapter "Science with and for society"gave an important boost to this type of initiatives in Europe and also in Spain. In fact, as a result of Spanish participation in this programme, as well as in parallel initiatives, Spanish projects have been increasing in size and connections with international initiatives.

This growing interest in citizen science also translates into concrete policies. An example of this is the current Spanish Strategy for Science, Technology and Innovation (EECTI), for the period 2021-2027, which includes "the social and economic responsibility of R&amp;D&amp;I through the incorporation of citizen science" which includes "the social and economic responsibility of I through the incorporation of citizen science".

In short, we commented some time agoin short, citizen science initiatives seek to encourage a more democratic sciencethat responds to the interests of all citizens and generates information that can be reused for the benefit of society. Here are some examples of citizen science projects that help collect data whose reuse can have a positive impact on society:

AtmOOs Academic Project: Education and citizen science on air pollution and mobility.

In this programme, Thigis developed a citizen science pilot on mobility and the environment with pupils from a school in Barcelona's Eixample district. This project, which is already replicable in other schoolsconsists of collecting data on student mobility patterns in order to analyse issues related to sustainability.

On the website of AtmOOs Academic you can visualise the results of all the editions that have been carried out annually since the 2017-2018 academic year and show information on the vehicles used by students to go to class or the emissions generated according to school stage.

WildINTEL: Research project on life monitoring in Huelva

The University of Huelva and the State Agency for Scientific Research (CSIC) are collaborating to build a wildlife monitoring system to obtain essential biodiversity variables. To do this, remote data capture photo-trapping cameras and artificial intelligence are used.

The wildINTEL project project focuses on the development of a monitoring system that is scalable and replicable, thus facilitating the efficient collection and management of biodiversity data. This system will incorporate innovative technologies to provide accurate and objective demographic estimates of populations and communities.

Through this project which started in December 2023 and will continue until December 2026, it is expected to provide tools and products to improve the management of biodiversity not only in the province of Huelva but throughout Europe.

IncluScience-Me: Citizen science in the classroom to promote scientific culture and biodiversity conservation.

This citizen science project combining education and biodiversity arises from the need to address scientific research in schools. To do this, students take on the role of a researcher to tackle a real challenge: to track and identify the mammals that live in their immediate environment to help update a distribution map and, therefore, their conservation.

IncluScience-Me was born at the University of Cordoba and, specifically, in the Research Group on Education and Biodiversity Management (Gesbio), and has been made possible thanks to the participation of the University of Castilla-La Mancha and the Research Institute for Hunting Resources of Ciudad Real (IREC), with the collaboration of the Spanish Foundation for Science and Technology - Ministry of Science, Innovation and Universities.

The Memory of the Herd: Documentary corpus of pastoral life.

This citizen science project which has been active since July 2023, aims to gather knowledge and experiences from sheperds and retired shepherds about herd management and livestock farming.

The entity responsible for the programme is the Institut Català de Paleoecologia Humana i Evolució Social, although the Museu Etnogràfic de Ripoll, Institució Milà i Fontanals-CSIC, Universitat Autònoma de Barcelona and Universitat Rovira i Virgili also collaborate.

Through the programme, it helps to interpret the archaeological record and contributes to the preservation of knowledge of pastoral practice. In addition, it values the experience and knowledge of older people, a work that contributes to ending the negative connotation of "old age" in a society that gives priority to "youth", i.e., that they are no longer considered passive subjects but active social subjects.

Plastic Pirates Spain: Study of plastic pollution in European rivers.

It is a citizen science project which has been carried out over the last year with young people between 12 and 18 years of age in the communities of Castilla y León and Catalonia aims to contribute to generating scientific evidence and environmental awareness about plastic waste in rivers.

To this end, groups of young people from different educational centres, associations and youth groups have taken part in sampling campaigns to collect data on the presence of waste and rubbish, mainly plastics and microplastics in riverbanks and water.

In Spain, this project has been coordinated by the BETA Technology Centre of the University of Vic - Central University of Catalonia together with the University of Burgos and the Oxygen Foundation. You can access more information on their website.

Here are some examples of citizen science projects. You can find out more at the Observatory of Citizen Science in Spain an initiative that brings together a wide range of educational resources, reports and other interesting information on citizen science and its impact in Spain. do you know of any other projects? Send it to us at dinamizacion@datos.gob.es and we can publicise it through our dissemination channels.

calendar icon
Blog

Artificial intelligence (AI) is revolutionising the way we create and consume content. From automating repetitive tasks to personalising experiences, AI offers tools that are changing the landscape of marketing, communication and creativity. 

These artificial intelligences need to be trained with data that are fit for purpose and not copyrighted.  Open data is therefore emerging as a very useful tool for the future of AI.

The Govlab has published the report "A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI"  to explore this issue in more detail. It analyses the emerging relationship between open data and generative AI, presenting various scenarios and recommendations. Their key points are set out below.

The role of data in generative AI

Data is the fundamental basis for generative artificial intelligence models. Building and training such models requires a large volume of data, the scale and variety of which is conditioned by the objectives and use cases of the model. 

The following graphic explains how data functions as a key input and output of a generative AI system. Data is collected from various sources, including open data portals, in order to train a general-purpose AI model. This model will then be adapted to perform specific functions and different types of analysis, which in turn generate new data that can be used to further train models.

Title: The Role of Open Data in Generative AI. 1. Data. Collected, purchased or downloaded from open data portals. 2. Training. Models generalize patterns from data and apply them to new applications. 3. General-purpose AI. AI models are adapted for specific purposes using relevant datasets. 4. Adaptation for specific uses. May involve basing models on specific and relevant data.  5.1. Question answering. 5.2. Sentiment analysis. 5.3. Information extraction. 5.4. Image captioning. 5.5. Object recognition. New data is generated through user feedback and model results, which can be used to further train and refine the AI model. Source: adapted from the report “A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI”, by The Govlab, 2024.

Figure 1. The role of open data in generative AI, adapted from the report “A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI”,  The Govlab, 2024.

5 scenarios where open data and artificial intelligence converge

In order to help open data providers ''prepare'' their data for generative AI, The Govlab has defined five scenarios outlining five different ways in which open data and generative AI can intersect. These scenarios are intended as a starting point, to be expanded in the future, based on available use cases.

Scenario Function Quality requirements Metadata requirements Example
Pre-training Training the foundational layers of a generative AI model with large amounts of open data. High volume of data, diverse and representative of the application domain and non-structured usage. Clear information on the source of the data. Data from NASA''s Harmonized Landsat Sentinel-2 (HLS) project were used to train the geospatial foundational model watsonx.ai.
Adaptation Refinement of a pre-trained model with task-specific open data, using fine-tuning or RAG techniques. Tabular and/or unstructured data of high accuracy and relevance to the target task, with a balanced distribution. Metadata focused on the annotation and provenance of data to provide contextual enrichment. Building on the LLaMA 70B model, the French Government created LLaMandement, a refined large language model for the analysis and drafting of legal project summaries. They used data from SIGNALE, the French government''s legislative platform.
Inference and Insight Generation Extracting information and patterns from open data using a trained generative AI model. High quality, complete and consistent tabular data. Descriptive metadata on the data collection methods, source information and version control. Wobby is a generative interface that accepts natural language queries and produces answers in the form of summaries and visualisations, using datasets from different offices such as Eurostat or the World Bank. 
Data Augmentation Leveraging open data to generate synthetic data or provide ontologies to extend the amount of training data. Tabular and/or unstructured data which is a close representation of reality, ensuring compliance with ethical considerations. Transparency about the generation process and possible biases. A team of researchers adapted the US Synthea model to include demographic and hospital data from Australia. Using this model, the team was able to generate approximately 117,000 region-specific synthetic medical records.
Open-Ended Exploration Exploring and discovering new knowledge and patterns in open data through generative models. Tabular data and/or unstructured, diverse and comprehensive. Clear information on sources and copyright, understanding of possible biases and limitations, identification of entities. NEPAccess is a pilot to unlock access to data related to the US National Environmental Policy Act (NEPA) through a generative AI model. It will include functions for drafting environmental impact assessments, data analysis, etc.

Figure 2. Five scenarios where open data and Artificial Intelligence converge, adapted from the report “A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI”, The Govlab, 2024.

You can read the details of these scenarios in the report, where more examples are explained. In addition, The Govlab has also launched an observatory where it collects examples of intersections between open data and generative artificial intelligence. It includes the examples in the report along with additional examples. Any user can propose new examples via this form. These examples will be used to further study the field and improve the scenarios currently defined.

Among the cases that can be seen on the web, we find a Spanish company: Tendios. This is a software-as-a-service company that has developed a chatbot to assist in the analysis of public tenders and bids in order to facilitate competition. This tool is trained on public documents from government tenders.

Recommendations for data publishers

To extract the full potential of generative AI, improving its efficiency and effectiveness, the report highlights that open data providers need to address a number of challenges, such as improving data governance and management. In this regard, they contain five recommendations:

  1. Improve transparency and documentation. Through the use of standards, data dictionaries, vocabularies, metadata templates, etc. It will help to implement documentation practices on lineage, quality, ethical considerations and impact of results.
  2. Maintaining quality and integrity. Training and routine quality assurance processes are needed, including automated or manual validation, as well as tools to update datasets quickly when necessary. In addition, mechanisms for reporting and addressing data-related issues that may arise are needed to foster transparency and facilitate the creation of a community around open datasets.
  3. Promote interoperability and standards. It involves adopting and promoting international data standards, with a special focus on synthetic data and AI-generated content.
  4. Improve accessibility and user-friendliness. It involves the enhancement of open data portals through intelligent search algorithms and interactive tools. It is also essential to establish a shared space where data publishers and users can exchange views and express needs in order to match supply and demand.
  5. Addressing ethical considerations. Protecting data subjects is a top priority when talking about open data and generative AI. Comprehensive ethics committees and ethical guidelines are needed around the collection, sharing and use of open data, as well as advanced privacy-preserving technologies.

This is an evolving field that needs constant updating by data publishers. These must provide technically and ethically adequate datasets for generative AI systems to reach their full potential.

calendar icon