Blog

"I'm going to upload a CSV file for you. I want you to analyze it and summarize the most relevant conclusions you can draw from the data". A few years ago, data analysis was the territory of those who knew how to write code and use complex technical environments, and such a request would have required programming or advanced Excel skills. Today, being able to analyse data files in a short time with AI tools gives us great professional autonomy. Asking questions, contrasting preliminary ideas and exploring information first-hand changes our relationship with knowledge, especially because we stop depending on intermediaries to obtain answers. Gaining the ability to analyze data with AI independently speeds up processes, but it can also cause us to become overconfident in conclusions.
Based on the example of a raw data file, we are going to review possibilities, precautions and basic guidelines to explore the information without assuming conclusions too quickly.

The file:

To show an example of data analysis with AI we will use a file from the National Institute of Statistics (INE) that collects information on tourist flows in Europe, specifically on occupancy in rural tourism accommodation. The data file contains information from January 2001 to December 2025. It contains disaggregations by sex, age and autonomous community or city, which allows comparative analyses to be carried out over time. At the time of writing, the last update to this dataset was on January 28, 2026.

Screenshot of the National Statistics Institute (INE) page corresponding to a dataset. The title of the dataset appears at the top: “Survey of occupancy in rural tourism accommodation.” Below this is a line with an icon and the option to access via API next to the name of the dataset: “Travelers and overnight stays by autonomous communities and cities.” Above this line, there is a pop-up box (tooltip) with the following text: "Last modified: 01/28/2026 Dimensions: Autonomous Communities and Cities, Travelers and overnight stays, Residence Period: Jan. 2001 - Dec. 2025." To the left of the rows are several circular icons and “API” labels indicating options for accessing the data. In the background, part of the text on the page related to the survey and its dimensions can be seen, although it is partially covered by the pop-up box.

Figure 1. Dataset information. Source: National  Institute of Statistics (INE).

1. Initial exploration

For this first exploration we are going to use a free version of Claude, the AI-based multitasking chat developed by Anthropic. It is one of the most advanced language models in reasoning and analysis benchmarks, which makes it especially suitable for this exercise, and it is the most widely used option currently by the community to perform tasks that require code.

Let's think that we are facing the data file for the first time. We know in broad strokes what it contains, but we do not know the structure of the information. Our first prompt, therefore, should focus on describing it:

PROMPT: I want to work with a data file on occupancy in rural tourism accommodation. Explain to me what structure the file has: what variables it contains, what each one measures and what possible relationships exist between them. It also points out possible missing values or elements that require clarification.

Screenshot of Claude's initial exploration of the data file. It includes the section “General description,” which indicates that the file contains 36,000 rows (plus the header) with monthly data on travelers and overnight stays in rural tourism accommodations in Spain, presumably from the INE (National Institute of Statistics). It covers the period from January 2001 to December 2025, with a breakdown by territory and by origin of visitors. Below is the “Variables (columns)” section, which explains several columns in the dataset: Territorial totals, Autonomous Communities and Cities, Travelers and overnight stays, Residence: Level 1 and Residence: Level 2, with descriptions of what each variable measures and the possible values (e.g., travelers, overnight stays, residents in Spain, or residents abroad). Details of each section are provided below. Territorial Totals — This column has only one value in the entire file: “National Total.” This column identifies the maximum level of aggregation, but in practice it does not provide any variation. It probably exists because the original source allows for other levels (provinces, municipalities) that are not included here. Autonomous Communities and Cities — Identifies the specific territory: the 17 autonomous communities plus Ceuta and Melilla (19 territories in total). When this column is empty (NaN), the row corresponds to the aggregate for Spain as a whole, i.e., the “National Total” without a territorial breakdown. There are 1,800 rows in this situation.  Travelers and overnight stays — Indicates what magnitude is being measured. It only has two values: Travelers (number of people staying) and Overnight stays (number of nights those travelers spend). Each row contains one of the two metrics, so both coexist in separate rows for the same territories and periods.  Residence: Level 1 — Only contains the value “Total” in all rows. This is a higher level of aggregation, indicating that the data is not broken down by type of residence at this level. It does not provide useful variation as it appears in the file. Residence: Level 2 — Breaks down the origin of the traveler into three possibilities: the aggregate (NaN, equivalent to “all”), “Residents in Spain,” and “Residents Abroad.” There are 12,000 rows with NaN, which correspond to the total without distinction of origin.

Figure 2. Initial exploration of the data file with Claude. Source: Claude.

Once Claude has given us the general idea and explanation of the variables, it is good practice to open the file and do a quick check. The objective is to assess that, at a minimum, the number of rows, the number of columns, the names of the variables, the time period and the type of data coincide with what the model has told us.

If we detect any errors at this point, the LLM may not be reading the data correctly. If after trying in another conversation the error persists, it is a sign that there is something in the file that makes it difficult to read automatically. In this case, it is best not to continue with the analysis, as the conclusions will be very apparent, but will be based on misinterpreted data.

2. Anomaly management

Second, if we have discovered anomalies, it is common to document them and decide how to handle them before proceeding with the analysis. We can ask the model to suggest what to do, but the final decisions will be ours. For example:

  • Missing values: if there are empty cells, we need to decide whether to fill them with an "average" value from the column or simply delete those rows.
  • Duplicates: we have to eliminate repeated rows or rows that do not provide new information.
  • Formatting errors or inconsistencies: we must correct these so that the variables are coherent and comparable. For example, dates represented in different formats.
  • Outliers: if a number appears that does not make sense or is exaggeratedly different from the rest, we have to decide whether to correct it, ignore it or treat it as it is.
     

Ejemplo de análisis de valores faltantes con Claude. Se le pregunta a Claude “Enséñame un ejemplo de los valores ausentes en Total para Ceuta y Melilla” y su respuesta es: Aquí lo tienes. Las filas nulas de Ceuta y Melilla tienen exactamente la misma estructura: la columna Total está vacía en todos los periodos y para todas las combinaciones de métrica y origen, sin excepción. Son 1.800 filas nulas por cada ciudad (300 meses × 2 métricas × 3 niveles de residencia = 1.800), y ese patrón es perfectamente uniforme: no hay ni un solo mes con dato registrado para ninguna de las dos ciudades. Esto confirma que no se trata de datos perdidos puntualmente, sino de una ausencia total y sistemática. El INE simplemente no recoge turismo rural en Ceuta y Melilla, y las filas existen en el fichero porque la estructura de la tabla las genera por construcción al cruzar todos los territorios con todos los periodos.

Figure 3. Example of missing values analysis with Claude. Source: Claude.

In the case of our file, for example, we have detected that in Ceuta and Melilla the missing values in the Total variable are structural, there is no rural tourism registered in these cities, so we could exclude them from the analysis.

Before making the decision, a good practice at this point is to ask the LLM for the pros and cons of modifying the data. The answer can give us some clue as to which is the best option, or indicate some inconvenience that we had not taken into account.
 

Example of Claude's analysis of whether or not to remove values. Claude is asked, “Does it make sense to remove Ceuta and Melilla from the analysis?” and responds: It depends on the purpose of the analysis, but in most cases it does make sense to remove them, for several reasons. The most obvious is the one we already saw: they do not contribute any useful data. With 3,600 completely empty rows, their presence only adds noise and can complicate aggregate calculations or visualizations without providing any informational benefit. Furthermore, their exclusion is conceptually justified. Ceuta and Melilla are autonomous cities of an urban and port nature, without the type of rural environment that characterizes this type of accommodation. It is not that they lack data: it is that the phenomenon measured by the statistics simply does not exist there in any appreciable form. Including them would distort any comparison between territories, especially if you calculate averages, rankings, or map representations. That said, there are situations in which it might make sense to keep them. If the analysis is purely descriptive of the original file (e.g., a data quality report), removing them would hide relevant information about territorial coverage. And if at some point the INE began to record data for those cities, having them in the structure would facilitate future incorporation.

Figure 4. Claude's analysis on the possibility of eliminating or not securities. Source: Claude.

If we decide to go ahead and exclude the cities of Ceuta and Melilla from the analysis, Claude can help us make this modification directly on the file. The prompt would be as follows:

PROMPT: Removes all rows corresponding to Ceuta and Melilla from the file, so that the rest of the data remains intact. Also explain the steps you're following so they can review them.
 

Capture of Cloud's response to the previous prompt: "It executed two commands and read a file > These are the steps I followed: 1. Loading the original file without modifying it, maintaining the tab separator and UTF-8 encoding with BOM used by the INE file. 2. Filtered using the inverse condition: all rows whose value in Autonomous Communities and Cities is not ‘18 Ceuta’ or ‘19 Melilla’ are retained. The rest of the rows, including those in the National Total (where that column is empty), remain intact. 3. Verification before saving, checking three things: that the difference in rows is exactly 3,600 (as expected: 1,800 from Ceuta + 1,800 from Melilla), that the remaining 17 territories are correct, and that the null values in Total have disappeared completely, confirming that all the null values in the file came exclusively from those two cities. 4. Saved in a new file, maintaining the same format (TSV, UTF-8 with BOM) so that it is compatible with the same tools as the original. The result is a file with 32,400 rows, with no missing values in Total. Travelers and overnight stays without Ceuta Melilla Table · CSV Download

Figura 5. Step by step in the modification of data in Claude. Source: Claude.

At this point, Claude offers to download the modified file again, so a good checking practice would be to manually validate that the operation was done correctly. For example, check the number of rows in one file and another or check some rows at random with the first file to make sure that the data has not been corrupted.

3. First questions and visualizations

If the result so far is satisfactory, we can already start exploring the data to ask ourselves initial questions and look for interesting patterns. The ideal when starting the exploration is to ask big, clear and easy to answer questions with the data, because they give us a first vision.

PROMPT: It works with the file without Ceuta and Melilla from now on. Which have been the five communities with the most rural tourism in the total period?
 

Captura de pantalla de una respuesta generada por Cloud. En la parte superior aparece la indicación:  “Ejecutó 2 comandos >”  Debajo se lee el siguiente texto:  “Tomando como referencia el total acumulado de viajeros a lo largo de todo el período (enero 2001 – diciembre 2025) y usando el agregado sin distinción de origen (residentes en España + extranjeros), el ranking es el siguiente:”  A continuación aparece una tabla con tres columnas: #, Comunidad, Viajeros totales.  Filas de la tabla:  1 — Castilla y León — 14.745.928 2 — Cataluña — 8.568.320 3 — Andalucía — 5.682.985 4 — Castilla-La Mancha — 5.256.132 5 — Illes Balears — 4.953.850  Debajo de la tabla aparece el siguiente párrafo explicativo:  “Castilla y León destaca de forma muy clara, casi duplicando a Cataluña, la segunda. Su posición de liderazgo tiene sentido dado que es la comunidad con mayor superficie rural de España y cuenta con una oferta consolidada de casas rurales desde hace décadas. Llama la atención la presencia de Illes Balears en el top 5, ya que habitualmente se asocia más al turismo de sol y playa, lo que sugiere que también tiene un segmento rural relevante.”

Figure 6. Claude's response to the five communities with the most rural tourism in the period. Source: Claude.

Finally, we can ask Claude to help us visualize the data. Instead of making the effort to point you to a particular chart type, we give you the freedom to choose the format that best displays the information.

PROMPT: Can you visualize this information on a graph? Choose the most appropriate format to represent the data.

Captura de pantalla de una interfaz dividida en dos paneles que muestra la respuesta de Cloud a la petición de visualizar datos en un gráfico.  Panel izquierdo (conversación / explicación)  En la parte superior aparece una etiqueta:  “Estructura de datos de ocupación en turismo rural”  Debajo aparece el mensaje del usuario:  “¿Puedes visualizar esta información en un gráfico? Elige el formato más adecuado para representar los datos.”  A continuación, el asistente responde:  “Visualizo un gráfico de barras horizontales, que es el formato más adecuado para comparar magnitudes entre categorías nominales como las comunidades autónomas. Las barras horizontales permiten leer los nombres completos y facilitan comparar visualmente las distancias entre valores, algo que un gráfico de tarta o líneas haría tan bien aquí.  Cada barra se escala proporcionalmente respecto al máximo (Castilla y León) y se animan al cargar para que la comparación resulte más intuitiva.”  Debajo aparece un bloque descargable con un icono de archivo y el texto:  “Top5 turismo rural” Código · HTML  A la derecha del bloque hay un botón:  “Descargar”  En la parte inferior izquierda del panel aparece un pequeño icono de carga (spinner).  Panel derecho (vista previa del gráfico)  En la parte superior derecha hay una barra con el nombre del archivo:  “Top5 turismo rural · HTML”  A la derecha aparece un botón:  “Copiar”  Debajo se muestra la visualización:  Texto pequeño en la parte superior:  “ENCUESTA DE OCUPACIÓN EN ALOJAMIENTOS DE TURISMO RURAL · INE”  Título grande:  “Top 5 comunidades en turismo rural”  Subtítulo:  “Total acumulado de viajeros · 2001–2025”  Debajo aparece un gráfico de barras horizontales con cinco comunidades y sus valores aproximados:  Castilla y León — barra larga de color rojo oscuro — 14,7M  Cataluña — barra verde — 8,6M  Andalucía — barra marrón — 5,7M  Castilla-La Mancha — barra verde claro — 5,3M  Illes Balears — barra naranja claro — 4,9M  En la parte inferior aparece un pequeño texto de nota:  “Datos agregados de viajeros en alojamientos de turismo rural (residentes en España + extranjeros). Fuente: Encuesta de Ocupación en Alojamientos de Turismo Rural, INE.”  En la esquina inferior derecha se lee:  “España · 2001–2025”.

Figure 7. Graph prepared by Cloude to represent the information. Source: Claude.

Here, the screen unfolds: on the left, we can continue with the conversation or download the file, while on the right we can view the graph directly. Claude has generated a very visual and ready-to-use horizontal bar chart. The colors differentiate the communities and the date range and type of data are correctly indicated.

What happens if we ask you to change the color palette of the chart to an inappropriate one? In this case, for example, we are going to ask you for a series of pastel shades that are hardly different.

PROMPT: Can you change the color palette of the chart to this? #E8D1C5, #EDDCD2, #FFF1E6, #F0EFEB, #EEDDD3
 

Screenshot of a split-screen interface showing Cloud's response to the request to change the color palette of a graph.  Left panel (conversation)  At the top is a label:  “Rural tourism occupancy data structure.”  Below is the user's message in a speech bubble:  "Can you change the color palette of the graph to this one? #E8D1C5, #EDDCD2, #FFF1E6, #F0EFEB, #EEDDD3“  Below is the assistant's response:  ”I will use the colors from the new palette in soft and light tones, which may make the bars difficult to distinguish from the background. To maintain readability, I will slightly darken the background and adjust the text color of the value labels so that they contrast well.  Done. I have applied the five colors you specified to the bars. Since the palette is very light and pale, I have also changed the background to a dark shade so that the bars stand out and the chart is readable, keeping the rest of the design intact."  Below is a downloadable block with a file icon and the text:  “Top 5 rural tourism” Code · HTML.  To the right of the block there is a button:  “Download”  At the bottom left of the panel there are several status indicators and a small loading icon (spinner).  Right panel (graph preview) At the top there is a bar with the file name: “Top 5 rural tourism · HTML.” On the right are action buttons, including: “Copy.” Below is the visualization with a dark background.  Small text at the top of the graph:  “SURVEY OF OCCUPANCY IN RURAL TOURISM ACCOMMODATION · INE”  Main title:  “Top 5 communities in rural tourism”  Subtitle:  “Cumulative total of travelers · 2001–2025”  Below is a horizontal bar chart with five communities and values to the right of each bar. The bars use light shades from the requested palette.  Graph elements:  Castile and León — long light-colored bar — 14.7 million  Catalonia — light bar — 8.6 million  Andalusia — light bar — 5.7 million  Castile-La Mancha — light bar — 5.3 million  Balearic Islands — light bar — 5.0M  At the bottom, there is a small explanatory text:  “Aggregate data on travelers in rural tourism accommodations (residents in Spain + foreigners). Source: Survey of Occupancy in Rural Tourism Accommodations, INE.”  In the lower right corner of the graph, it reads:  “Spain · 2001–2025.”

.Figure 8. Adjustments made to the graph by Claude to represent the information. Source: Claude.

Faced with the challenge, Claude intelligently adjusts the graphic himself, darkens the background and changes the text on the labels to maintain readability and contrast

 

All of the above exercise has been done with Claude Sonnet 4.6, which is not Anthropic's highest quality model. Its higher versions, such as Claude Opus 4.6, have greater reasoning capacity, deep understanding and finer results. In addition, there are many other tools for working with AI-based data and visualizations, such as Julius or Quadratic. Although the possibilities are almost endless in them, when we work with data it is still essential to maintain our own methodology and criteria. 

Contextualizing the data we are analyzing in real life and connecting it with other knowledge is not a task that can be delegated; We need to have a minimum prior idea of what we want to achieve with the analysis in order to transmit it to the system. This will allow us to ask better questions, properly interpret the results and therefore make a  more effective prompting.

Content created by Carmen Torrijos, expert in AI applied to language and communication. The content and views expressed in this publication are the sole responsibility of the author.

calendar icon
Blog

We live in an era where science is increasingly reliant on data. From urban planning to the climate transition, data governance has become a structural pillar of evidence-based decision-making. However, there is one area where the traditional principles of data management, validation and control are subjected to extreme tensions: the universe.

Space data—produced by scientific satellites, telescopes, interplanetary probes, and exploration missions— do not describe accessible or repeatable realities. They observe phenomena that occurred millions of years ago, at distances impossible to travel and under conditions that can never be replicated in the laboratory. There is no "in situ" measurement that directly confirms these phenomena.

In this context, data governance ceases to be an organizational issue and becomes a structural element of scientific trust. Quality, traceability and reproducibility cannot be supported by direct physical references, but by methodological transparency, comprehensive documentation and the robustness of instrumental and theoretical frameworks.

Governing data in the universe therefore involves facing unique challenges: managing structural uncertainty, documenting extreme scales, and ensuring trust in information we can never touch.

Below, we explore the main challenges posed by data governance when the object of study is beyond Earth.

I. Specific challenges of the datum of the universe

1. Beyond Earth: new sources, new rules

When we talk about space data, we mean much more than satellite images of the Earth's surface. We delve into a complex ecosystem that includes space and ground-based telescopes, interplanetary probes, planetary exploration missions, and observatories designed to detect radiation, particles, or extreme physical phenomena.

These systems generate data with clearly different challenges compared to other scientific domains:

Challenge Impact on data governance
Non-existent physical access There is no direct validation; Trust lies in the integrity of the channel.
Instrumental dependence The data is a direct "child" of the sensor's design. If the sensor fails or is out of calibration, reality is distorted.
Uniqueness Many astronomical events are unique. There is no "second chance" to capture them.
Extreme cost The value of each byte is very high due to the investment required to put the sensor into orbit

Figure 1. Challenges in data governance across the universe. Source: own elaboration - datos.gob.es.

Unlike Earth observation data -which in many cases can be contrasted by field campaigns or redundant sensors -data from the universe depend fundamentally on the mission architecture, instrument calibration, and physical models used to interpret the captured signal.

In many cases, what is recorded is not the phenomenon itself, but an indirect signal: spectral variations, electromagnetic emissions, gravitational alterations or particles detected after traveling millions of kilometers. The data is, in essence, an instrumental translation of an inaccessible phenomenon. 

For all these reasons, in space data cannot be understood without the technical context that generates it.

2. Structural uncertainty and extreme scales

Uncertainty refers to the degree of margin of error or indeterminacy associated with a scientific measurement, interpretation, or result due to the limits of the instruments, observing conditions, and models used to analyze the data. If in other areas uncertainty is a factor that is tried to be reduced by direct, repeatable and verifiable measurements, in the observation of the universe uncertainty is part of the knowledge process itself. It is not simply a matter of "not knowing enough", but of facing physical and methodological limits that cannot be completely eliminated.

Therefore, in the observation of the universe, uncertainty is structural. It is not a specific anomaly, but a condition inherent to the object of study.

There are several critical dimensions:

  • Extreme spatial and temporal scales: cosmic distances prevent any direct validation. Timescales imply that the data often captures an "instant" of the remote past and not a verifiable present reality.
  • Weak signals and unavoidable noise: the instruments capture extremely subtle emissions. The useful signal coexists with interference, technological limitations and background noise. Interpretation depends on advanced statistical treatments and complex physical models.
  • Limited-observation phenomena: Some astrophysical phenomena—such as certain supernovae, gamma-ray bursts, or singular gravitational configurations—cannot be experimentally recreated and can only be observed when they occur. In these cases, the available record may be unique or profoundly limited, increasing the responsibility for documentation and preservation.

Not all phenomena are unrepeatable, but in many cases the opportunities for observation are scarce or depend on exceptional conditions.

II. Building trust when we can't touch the object observed

In the face of these challenges, data governance takes on a structural role. It is not limited to guaranteeing storage or availability, but defines the rules by which scientific processes are documented, traceable and auditable.

In this context, governing does not mean producing knowledge, but rather ensuring that its production is transparent, verifiable and reusable.

1. Quality without direct physical validation

When the observed phenomenon cannot be directly verified, the quality of the data is based on:

  • Rigorous calibration protocols: instruments must undergo systematic calibration processes before, during, and after operation. This involves adjusting your measurements against known baselines, characterizing your margins of error, documenting deviations, and recording any modifications to your configuration. Calibration is not a one-off event, but an ongoing process that ensures that the recorded signal reflects, as accurately as possible, the observed phenomenon within the physical boundaries of the system.
  • Cross-validation between independent instruments: when different instruments – either on the same mission or on different missions – observe a similar phenomenon, the comparison of results allows the reliability of the data to be reinforced. The convergence between observations obtained with different technologies reduces the probability of instrumental bias or systematic errors. This inter-instrumental coherence acts as an indirect verification mechanism.
  • Observational repetition when possible:  although not all phenomena can be repeated, many observations can be made at different times or under different conditions. Repetition allows to evaluate the stability of the signal, identify anomalies and estimate natural variability against measurement error. Consistency over time strengthens the robustness of the result.
  • Peer review and progressive scientific consensus: the data and their interpretations are subject to evaluation by the scientific community. This process involves methodological scrutiny, critical analysis of assumptions, and verification of consistency with existing knowledge. Consensus does not emerge immediately, but through the accumulation of evidence and scientific debate. Quality, in this sense, is also a collective construction.

Quality is not just a technical property; it is the result of a documented and auditable process.

2. Complete scientific traceability

In the spatial context, data is inseparable from the technical and scientific process that generates it. It cannot be understood as an isolated result, but as the culmination of a chain of instrumental, methodological and analytical decisions.

Therefore, traceability must explicitly and documented:

  • Instrument design and configuration: information about the technical characteristics of the instrument that captured the signal, such as its architecture, sensing capabilities, resolution limits, and operational configurations, needs to be retained. These conditions determine what type of signal can be recorded and how accurately.
  • Calibration parameters: The adjustments applied to ensure that the instrument operates within the intended margins must be recorded, as well as the modifications made over time. The calibration parameters directly influence the interpretation of the obtained signal.
  • Processing software versions: the processing of raw data depends on specific IT tools. Preserving the versions used allows you to understand how the results were generated and avoid ambiguities if the software evolves.
  • Algorithms applied in noise reduction: since signals are often accompanied by interference or background noise, it is essential to document the methods used to filter, clean, or transform the information before analysis. These algorithms influence the final result.
  • Scientific assumptions used in the interpretation: the reading of the data is not neutral: it is based on theoretical frameworks and physical models accepted at the time of analysis. Recording these assumptions allows you to contextualize the conclusions and understand possible future revisions.
  • Successive transformations from the raw data to the published data: from the original signal to the final scientific product, the data goes through different phases of processing, aggregation and analysis. Each transformation must be able to be reconstructed to understand how the communicated result was reached.

Without exhaustive traceability, reproducibility is weakened and future interpretability is compromised. When it is not possible to reconstruct the entire process that led to a result, its independent evaluation becomes limited and its scientific reuse loses its robustness.

3. Long-term reproducibility

Space missions can span decades, and their data can remain relevant long after the mission has ended. In addition, scientific interpretation evolves over time: new models, new tools, and new questions may require reanalyzing information generated years ago.

Therefore, data must remain interpretable even when the original equipment no longer exists, technological systems have changed, or the scientific context has evolved.

This requires:

  • Rich and structured metadata: the contextual information that accompanies the data – about its origin, acquisition conditions, processing and limitations – must be organized in a clear and standardized way. Without sufficient metadata, the data loses meaning and becomes difficult to reinterpret in the future.
  • Persistent identifiers: Each dataset must be able to be located and cited in a stable manner over time. Persistent identifiers allow the reference to be maintained even if storage systems or technology infrastructures change.
  • Robust digital preservation policies: Long-term preservation requires strategies that take into account format obsolescence, technological migration, and archive integrity. It is not enough to store; It is necessary to ensure that the data remains accessible and readable over time.
  • Accessible documentation of processing pipelines: the process that transforms raw data into scientific product must be described in a comprehensible way. This allows future researchers to reconstruct the analysis, verify the results, or apply new methods on the same original data.

Reproducibility, in this context, does not mean physically repeating the observed phenomenon, but being able to reconstruct the analytical process that led to a given result. Governance doesn't just manage the present; It ensures the future reuse of knowledge and preserves the ability to reinterpret information in the light of new scientific advances.

Rules for capturing documented, traceable, and auditable spatial data. 1. Quality without direct physical validation. 1.1. Rigorous calibration protocols. 1.2. Cross-validation between independent instruments. 1.3. Observational repetition when possible. 1.4. Peer review and progressive scientific consensus. 2. Complete scientific traceability 2.1. Instrument design and configuration. 2.2. Calibration parameters. 2.3. Processing software versions. 2.4. Algorithms applied in noise reduction. 2.5. Scientific assumptions used in interpretation. 2.6. Successive transformations from raw data to published data. 3.	Long-term reproducibility. 3.1. Rich and structured metadata. 3.2. Persistent identifiers. 3.3. Robust digital preservation policies. 3.4. Accessible documentation of processing pipelines.

 Figure 2. Rules for capturing documented, traceable, and auditable spatial data. Source: own elaboration - datos.gob.es.

Conclusion: Governing What We Can't Touch

The data of the universe forces us to rethink how we understand and manage information. We are working with realities that we cannot visit, touch or verify directly. We observe phenomena that occur at immense distances and in times that exceed the human scale, through highly specialized instruments that translate complex signals into interpretable data.

In this context, uncertainty is not a mistake or a weakness, but a natural feature of the study of the cosmos. The interpretation of data depends on scientific models that evolve over time, and quality is not based on direct verification, but on rigorous processes, well documented and reviewed by the scientific community. Trust, therefore, does not arise from direct experience, but from the transparency, traceability and clarity with which the methods used are explained.

Governing spatial data does not only mean storing it or making it available to the public. It means keeping all the information that allows us to understand how they were obtained, how they were processed and under what assumptions they were interpreted. Only then can they be evaluated, reinterpreted and reused in the future.

Beyond Earth, data governance is not a technical detail or an administrative task. It is the foundation that sustains the credibility of human knowledge about the universe and the basis that allows new generations to continue exploring what we cannot yet achieve physically.

Content prepared by Mayte Toscano, Senior Consultant in technologies related to the data economy. The contents and viewpoints expressed in this publication are the sole responsibility of the author.

calendar icon
Blog

Data visualization is not a recent discipline. For centuries, people have used graphs , maps, and diagrams to represent complex information. Classic examples such as the statistical maps of the nineteenth century or the graphs used in the press show that the need to "see" the data in order to understand it has always existed.

For a long time, creating visualizations required specialized knowledge and access to professional tools, which limited their production to very specific profiles. However, the digital and technological revolution has profoundly transformed this landscape. Today, anyone with access to a computer and data can create visualizations. Tools have been democratized, many of them are free or open source, and visualization work has extended beyond design to integrate into areas such as statistics, data science, academic research, public administration, or education.

Today, data visualization is a transversal competence that allows citizens to explore public information, institutions to better communicate their policies, and reusers to generate new services and knowledge from open data. In this post we present some of the most accessible and used options in data visualization.

A broad and diverse ecosystem of tools

The ecosystem of data visualization tools is broad and diverse, both in functionalities and levels of complexity. There are options designed for a first exploration of the data, others aimed at in-depth analysis and some designed to create interactive visualizations or complex digital narratives.

This variety allows you to tailor the visualization to different contexts and goals—from understanding a dataset in advance to publishing interactive charts, dashboards, or maps on the web.

The Data Visualization Society's annual survey  reflects this diversity and shows how the use of certain tools evolves over time, consolidating some widely known options and giving way to new solutions that respond to emerging needs. These are some of the tools mentioned in the survey, ordered according to usage profiles.

The following criteria have been taken into account for the preparation of this list:

  • Degree of use and maturity of the tool.
  • Free access, free or with open versions.
  • Useful for projects related to public data.
  • Priority to open tools or with free versions.

Simple tools to get started

These tools are characterized by visual interfaces, a low learning curve, and the ability to create basic charts quickly. They are especially useful for getting started exploring open datasets or for outreach activities.

  • Excel: it is one of the most widespread and well-known tools. It allows basic graphs and first data scans to be carried out in a simple way. While not specifically designed for advanced visualization, it is still a common gateway to working with data and its graphical representation.
  • Google Sheets: works as a free and collaborative alternative to Excel. Its main advantage is the ability to work in a shared way and publish simple graphics online, which facilitates the dissemination of basic visualizations.
  • Datawrapper: widely used in public communication and data journalism. It allows you to create clear graphs, maps, and interactive tables without the need for technical knowledge. It is particularly suitable for explaining data in a way that is understandable to a wide audience.
  • RAWGraphs: free software tool aimed at visual exploration. It allows you to experiment with less common types of charts and discover new ways to represent data. It is especially useful in exploratory phases.
  • Canva: While its approach is more informative than analytical, it can be useful for creating simple visual pieces that integrate basic graphics with design elements. It is suitable for visual communication of results, not so much for data analysis.

Data exploration and analysis tools

This group of tools is geared towards profiles that want to go beyond basic charts and perform more structured analysis. Many of them are open and widely consolidated in the field of data analysis.

  • A: Free programming language widely used in statistics and data analysis. It has a wide ecosystem of packages that allow you to work with public data in a reproducible and transparent way.
  • Ggplot2: R language display library. It is one of the most powerful tools for creating rigorous and well-structured graphs, both for analysis and for communicating results.
  • Python (Matplotlib and Plotly): Python is one of the most widely used languages in data analysis. Matplotlib allows you to create customizable static charts, while Plotly makes it easy to create interactive visualizations. Together they offer a good balance between power and flexibility.
  • Apache Superset: Open source platform for data analysis and dashboard creation. It has a more institutional and scalable approach, making it suitable for organizations that work with large volumes of public data.

This block is especially relevant for open data reusers and intermediate technical profiles who seek to combine analysis and visualization in a systematic way.

Tools for interactive and web visualization

These tools allow you to create advanced visualizations for publication in web environments. Although they require greater technical knowledge, they offer great flexibility and expressive possibilities.

  • D3.js: it is one of the benchmarks in web visualization. It is based on open standards and allows full control over the visual representation of data. Its flexibility is very high, although so is its complexity.

In this practical exercise you can see how to use this library

  • Vega and Vega-Lite: declarative languages for visualization that simplify the use of D3. They allow you to define graphics in a structured and reproducible way, offering a good balance between power and simplicity.
  • Observable: interactive environment closely linked to D3 and Vega. It's especially useful for creating educational examples, prototypes, and exploratory visualizations that combine code, text, and graphics.
  • Three.js and WebGL: technologies aimed at advanced and three-dimensional visualizations. Its use is more experimental and is usually linked to dissemination projects or visual research.

In this section, it should be noted that, although the technical barriers are greater, these tools allow for the creation of rich interactive experiences that can be very effective in communicating complex public data.

Geospatial data and mapping tools

Geographic visualization is especially relevant in the field of open data, since a large part of public information has a territorial dimension. In this field, free software has a prominent weight and is closely aligned with use in public administrations.

  • QGIS: a benchmark in free software for geographic information systems (GIS). It is widely used in public administrations and allows spatial data to be analysed and visualised in great detail.
  • ArcGIS: very widespread in the institutional field. Although it is not free software, its use is well established and is part of the regular ecosystem of many public organizations.
  • Mapbox: platform aimed at creating interactive web maps. It is widely used in online visualization projects and allows geographic data to be integrated into web applications.
  • Leaflet: A popular open-source library for creating interactive maps on the web. It is lightweight, flexible, and widely used in geographic open data reuse projects.

This toolkit facilitates the territorial representation of data and its reuse in local, regional or national contexts.

In conclusion, the choice of a visualization tool depends largely on the goal being pursued. Learning and experimenting is not the same as analyzing data in depth or communicating results to a wide audience. Therefore, it is useful to reflect beforehand on the type of data available, the audience to which the visualization is aimed and the message you want to convey.

Betting on accessible and open tools allows more people to explore, interpret and communicate public data. In this sense, visualising data is also a way of bringing information closer to citizens and encouraging its reuse.

calendar icon
Blog

In recent years, the need for the international scientific community to have agile mechanisms to share research results in order to respond to challenges such as pandemics, the climate crisis, biodiversity loss or the energy transition has become clear. In this sense, R+D tasks have become intensive in the use of both data and specialized software. A concrete example occurred during the COVID-19 pandemic, when data sharing enabled rapid sequencing of the SARS-CoV-2 genome, which was essential for the development of the COVID-19 vaccine in record time.

It is, therefore, time to promote open science. But for open science to become a reality, it is essential to avoid the fragmentation of R+D resources. Beyond scientific publications, it is necessary to connect  distributed data repositories and promote software tools that are interoperable to facilitate the effective reuse of scientific datasets.

In this context, EOSC (European Open Science Cloud) was born, a European initiative that aims to connect the scientific community to make open science a reality and maximize its impact on society. EOSC offers researchers in Europe a multidisciplinary, open and trusted environment where they can publish, discover and reuse data, as well as software tools and services in the scientific field.

What is EOSC? Federated access to scientific resources

The European Open Science Cloud is the European initiative to create an open and trusted environment where the research community can publish, discover and reuse scientific data, as well as research software services. Its focus is to federate and scale scientific resources in Europe, promoting interoperability between disciplines. EOSC's ambition is to accelerate open science practices, increasing scientific productivity and strengthening the reproducibility of research in such a way as to maximise its impact on society. To this end, EOSC is conceived as a "system of systems", i.e. instead of centralizing all data and services on a single platform, EOSC interconnects existing platforms (i.e. performs a federation instead of an integration) such as data repositoriesresearch infrastructures, or scientific software service providers.

The European Commission places EOSC as the common European space for R+D data and aligns it with the European objective of achieving a data-driven economy and society. In terms of impact, this favors the following aspects:

  • Collaborative research, not only within the same scientific discipline but also between different disciplines and different territories.
  • Reuse and combination of digital scientific resources (such as datasets or software services), as well as the promotion of citizen science.
  • Impact on society through evidence-based policies, by improving the traceability, availability and interoperability of data that underpin public decisions.

To make EOSC a reality, a federated model is built based on nodes that act as coordinated entry points. Common policies and shared capabilities (e.g. federated authentication, catalogues and interoperability guides) are established on top of these that allow the reuse of data and services. This approach is embodied in the EOSC Federation, which connects infrastructures and communities to provide more homogeneous access to and reuse of scientific resources.

What is the EOSC Federation?

According to the EOSC Federation Handbook (a reference document describing its operational structure, legal and governance framework, and technical operation), the EOSC Federation is a distributed network of nodes. These nodes are interconnected and are able to collaborate to share and manage scientific knowledge and resources (such as datasets, software and services) across thematic and geographical communities, in compliance with FAIR principles. In other words, it is a distributed network that enables capacities to develop interoperable, secure and reliable open science at European level, across disciplines and borders.

As we have seen, the basic element of this federation are the EOSC Nodes (EOSC nodes) that function as entry points for the scientific community to the federation. These are platforms operated by organizations or consortia of territorial or thematic scope, which comprise:

  • A set of capabilities essential to operating, such as authentication and access services or resource catalog.
  • A set of resources, such as research data products.

A portion of those resources is selected as  the Node Exchange, representing what the node shares with the federation. By aggregating the contributions of multiple nodes, they constitute the EOSC Exchange, i.e., the global supply of resources in the federation.

For all this to work, Federating Capabilities are defined  as common capabilities (technical and also organizational, such as user support) that allow services to work between nodes and not as isolated silos. These capabilities are enabled by federating services operated by one or more nodes and are supported by interfaces and interoperability guides included in the EOSC Interoperability Framework. The following image graphically represents this process:

Conceptual diagram titled “Conceptual Representation of the EOSC Federation.” The image shows a network structure organized into several interconnected levels of nodes.  At the top appears the “EOSC EU Node.” Below it are different types of nodes: “European Node (e.g., e-Infra),” “National Node,” “Regional Node,” and several “Thematic Nodes (e.g., Research Infrastructure; Scientific Cluster).” These nodes are interconnected, representing federation across different geographic and thematic levels.  At the bottom, three functional blocks are shown: “Federating Capabilities,” “Services Participating in Federating Capabilities,” and “Services to Connect to a Federating Capability.” These elements are linked to the upper nodes through “Federating Interfaces defined in the EOSC Interoperability Framework,” which act as a common connection layer.  Overall, the diagram illustrates how different nodes (European, national, regional, and thematic) and their services are integrated through shared capabilities and interfaces within the EOSC Federation framework.  Source: EOSC Federation Handbook.

Figure 1. Conceptual representation of the EOSC Federation (source: EOSC Federation Handbook).

There are two mandatory federated capabilities: on the one hand, the authentication and authorization infrastructure (AAI) and, on the other, the resource catalogs that allow the scientific community to discover and access resources offered by the nodes, not only manually but also through computer services. These first capabilities are articulated in the EOSC EU Node.

EOSC EU Node: the first operational node

In this federated model, the EOSC EU Node (promoted by the European Commission) is especially relevant as the first node of the EOSC Federation, providing an initial set of data, tools and services, and acting as  a reference node to facilitate the interconnection of other nodes.

This node allows researchers to access with institutional credentials capabilities such as virtual machines, resources such as GPUs, interactive notebooks, containerized scientific workflows, storage, data transfer, and collaborative tools, as well as connect to a catalog of resources to discover research results (scientific datasets, publications, or specialized software services) from federated infrastructures.

Conclusions

EOSC allows dispersed scientific resources to be transformed into an interoperable and reusable ecosystem that allows the scientific community to develop the objectives of open science. The EOSC Federation, through connected nodes and federated capabilities (such as AAI, catalogues or interoperability guides), facilitates access to FAIR data, services and software tools, accelerating scientific collaboration and reproducibility, as well as allowing the promotion of citizen science proposals and promoting the impact of scientific results on society. Finally, it should be noted that EOSC does not replace what already exists, but connects it, makes it interoperable and projects it on a European scale. In Spain, the definition of a national node to connect existing capacities with the EOSC Federation is advancing. Therefore, the early participation of repositories, infrastructures, research centers, universities and service providers will be key to building a representative offer, defining priorities and maximizing scientific and social impact.

Jose Norberto Mazón, Professor of Languages and Computer Systems at the University of Alicante. The contents and views expressed in this publication are the sole responsibility of the author.

calendar icon
Blog

Data visualizations act as bridges between complex information and human understanding. A well-designed graph can communicate in seconds data that would take minutes or even hours to decipher in tabular format. What's more, interactive visualizations allow each user to explore data from their own perspective, filtering, comparing, and uncovering personalized  insights.

To achieve these ends there are multiple tools, some of which we have addressed on previous occasions. Today we are approaching a new example: the free bookstore D3.js. In this post, we explain how it allows you to generate useful and attractive data visualizations together with the open source tool Observable.

What is D3?

D3.js (Data-Driven Documents) is a JavaScript library that allows you to create custom data visualizations in web browsers. Unlike tools that offer predefined charts, D3.js provides the fundamental elements to build virtually any type of visualization imaginable.

The library is completely free and open source, published under a BSD license, which means that any person or organization can use, modify, and distribute it without restrictions. This feature has contributed to its widespread adoption: international media such as The New York Times, The Guardian, Financial Times, and local media such as El País or ABC use D3.js to create journalistic visualizations that help tell stories with data.

D3.js works by manipulating the browser's DOM (Document Object Model). In practical terms, this means that it takes information (e.g., a CSV file with population data) and transforms it into visual elements (circles, bars, lines) that the browser can display. The power of D3.js lies in its flexibility: it doesn't impose a specific way to visualize data, but rather provides the tools to create exactly what is needed.

What is Observable?

Observable is a web-based platform for creating and sharing code, specially designed to work with data and visualizations. Although it offers a freemium  service with some free and some paid features, it maintains an open-source philosophy  that is particularly relevant for working with public data.

The distinguishing feature of Observable is its "notebook" format. Similar to tools like Jupyter Notebooks in Python, an Observable notebook combines code, visualizations, and explanatory text into a single interactive document. Each cell in the notebook can contain JavaScript code that runs immediately, displaying results instantly. This creates an ideal experimentation environment for exploring data.

You can see it in practice in this data science exercise that we have published in datos.gob.es

Observable integrates naturally with D3.js and other display libraries. In fact, the creator of D3.js is also one of the founders of Observable, so both tools work together in a fluid way. Observable notebooks can be shared publicly, allowing other users to view both the code and the results, fork them to create their own versions, or integrate them into their own projects.

Advantages of the tool to work with all types of data

Both D3.js and Observable have features that can be useful for working with data, including open data:

  • Transparency and reproducibility: by publishing a visualization created with these tools, it is possible to share both the final result and the entire data transformation process. Anyone can inspect the code, verify the calculations, and reproduce the results. This transparency is essential when working with public information, where trust and verifiability are essential.
  • No licensing costs: Both D3.js and the free version of Observable allow you to create and publish visualizations without the need to purchase software licenses. This removes economic barriers for organizations, journalists, researchers, or citizens who want to work with open data.
  • Standard web formats: The created visualizations work directly in web browsers without the need for plugins or additional  software. This makes it easy to integrate them into institutional websites, newspaper articles or digital reports, making them accessible from any device.
  • Community and resources: There is a large community of users who share examples, tutorials, and solutions to common problems. Observable, in particular, houses thousands of public notebooks that serve as examples and reusable templates.
  • Technical flexibility: Unlike tools with predefined options, these libraries allow you to create completely customized visualizations that are exactly tailored to the specific needs of each dataset or story you want to tell.

It is important to note that these tools require programming knowledge, specifically JavaScript. For people with no programming experience, there is a learning curve that can be steep initially. Other tools such as spreadsheets or visualization software with graphical interfaces may be more appropriate for users looking for quick results without writing code.

For those looking for open source alternatives  with a smooth learning curve, there are visual interface-based tools that don't require programming. For example, RawGraphs allows you to create complex visualizations by simply dragging and dropping files, while Datawrapper is an excellent and very intuitive option for generating ready-to-publish charts and maps.

In addition, there are numerous open source and commercial alternatives for visualizing data: Python with libraries such as Matplotlib or Plotly, R with ggplot2, Tableau Public, Power BI, among many others. In the didactic section of visualization and data science exercises of datos.gob.es you can find practical examples of how to use some of them.

In summary, the choice of tools should always be based on an assessment of specific requirements, available resources, and project objectives. The important thing is that open data is transformed into accessible knowledge, and there are multiple ways to achieve this goal. D3.js and Observable offer one of these paths, particularly suited to those looking to combine technical flexibility with principles of openness and transparency. If you know of any other tool or would like us to delve into another topic, please send it to us through our social networks or in the contact form.

calendar icon
Blog

Since its origins, the open data movement has focused mainly on promoting the openness of data and promoting its reuse. The objective that has articulated most of the initiatives, both public and private, has been to overcome the obstacles to publishing increasingly complete data catalogues and to ensure that public sector information is available so that citizens, companies, researchers and the public sector itself could create economic and social value.

However, as we have taken steps towards an economy that is increasingly dependent on data and, more recently, on artificial intelligence – and in the near future on the possibilities that autonomous agents bring us through agentic artificial intelligence – priorities have been changing and the focus has been shifting towards issues such as improving the quality of published data.

It is no longer enough for the datasets to be published in an open data portal complying with good practices, or even for the data to meet quality standards at the time of publication. It is also necessary that this publication of the datasets meets service levels that transform the mere provision into an operational commitment that mitigates the uncertainties that often hinder reuse.

When a developer integrates a real-time transportation data API into their mobility app, or when a data scientist works on an AI model with historical climate data, they are taking a risk if they are uncertain about the conditions under which the data will be available. If at any given time the published data becomes unavailable because the format changes without warning, because the response time skyrockets, or for any other reason, the automated processes fail and the data supply chain breaks, causing cascading failures in all dependent systems.

In this context, the adoption  of service level agreements (SLAs) could be the next step for open data portals to evolve from the usual "best effort" model  to become critical, reliable and robust digital infrastructures.

What are an SLA  and a Data Contract in the context of open data?

In the context of site reliability engineering (SRE), an SLA is a contract negotiated between a service provider and its customers in order to set the level of quality of the service provided. It is, therefore, a tool that helps both parties to reach a consensus on aspects such as response time, time availability or available documentation.

In an open data portal, where there is often no direct financial consideration, an SLA could help answer questions such as:

  • How long will the portal and its APIs be available?
  • What response times can we expect?
  • How often will the datasets be updated?
  • How are changes to metadata, links, and formatting handled?
  • How will incidents, changes and notifications to the community be managed?

In addition, in this transition towards greater operational maturity, the concept, still immature, of the data contract (data contract) emerges. If the SLA is an agreement that defines service level expectations, the data contract is an implementation that formalizes this commitment. A data contract would not only specify the schema and format, but would act as a safeguard: if a system update attempts to introduce a change that breaks the promised structure or degrades the quality of the data, the data contract allows you to detect and block such an anomaly before it affects end users.

INSPIRE as a starting point: availability, performance and capacity

The European Union's Infrastructure for Spatial Information (INSPIRE) has established one of the world's most rigorous frameworks for quality of service for geospatial data. Directive 2007/2/EC, known as INSPIRE, currently in its version 5.0, includes some technical obligations that could serve as a reference for any modern data portal. In particular , Regulation (EC) No 976/2009 sets out criteria that could well serve as a standard for any strategy for publishing high-value data:

  • Availability: Infrastructure must be available 99% of the time during normal operating hours.
  • Performance: For a visualization service, the initial response should arrive in less than 3 seconds.
  • Capacity: For a location service, the minimum number of simultaneous requests served with guaranteed throughput must be 30 per second.

To help comply with these service standards, the European Commission offers tools such as the INSPIRE Reference Validator. This tool helps not only to verify syntactic interoperability (that the XML or GML is well formed), but also to ensure that network services comply with the technical specifications that allow those SLAs to be measured.

At this point, the demanding SLAs of the European spatial data infrastructure make us wonder if we should not aim for the same for critical health, energy or mobility data or for any other high-value dataset.

What an SLA could cover on an open data platform

When we talk about open datasets in the broad sense, the availability of the portal is a necessary condition, but not sufficient. Many issues that affect the reuser community are not complete portal crashes, but more subtle errors such as broken links, datasets that are not updated as often as indicated, inconsistent formats between versions, incomplete metadata, or silent changes in API behavior or dataset column names.

Therefore, it would be advisable to complement the SLAs of the portal infrastructure with "data health" SLAs that can be based on  already established reference frameworks such as:

  • Quality models such as ISO/IEC 25012, which allows the quality of the data to be broken down into measurable dimensions such as accuracy (that the data represents reality), completeness (that necessary values are not missing) and consistency (that there are no contradictions between tables or formats) and convert them into measurable requirements.
  • FAIR Principles, which stands for Findable, Accessible, Interoperable, and Reusable. These principles emphasize that digital assets should not only be available, but should be traceable using persistent identifiers, accessible under clear protocols, interoperable through the use of standard vocabularies, and reusable thanks to clear licenses and documented provenance. The FAIR principles can be put into practice by systematically measuring the quality of the metadata that makes location, access and interoperability possible. For example,  data.europa.eu's Metadata Quality Assurance (MQA) service  helps you automatically evaluate catalog metadata, calculate metrics, and provide recommendations for improvement.

To make these concepts operational, we can focus on four examples where establishing specific service commitments would provide a differential value:

  • Catalog compliance and currency: The SLA could ensure that the metadata is always aligned with the data it describes. A compliance commitment would ensure that the portal undergoes periodic validations (following specifications such as DCAT-AP-ES or HealthDCAT-AP) to prevent the documentation from becoming obsolete with respect to the actual resource.
  • Schema stability and versioning: One of the biggest enemies of automated reuse is "silent switching." If a column changes its name or a data type changes, the data ingestion flows will fail immediately. A service level commitment might include a versioning policy. This would mean that any changes that break compatibility would be announced at least notice, and preferably keep the previous version in parallel for a reasonable amount of time.
  • Freshness and refresh frequency: It's not uncommon to find datasets labeled as daily but last actually modified months ago. A good practice could be the definition of publication latency indicators. A possible SLA would establish the value of the average time between updates and would have alert systems that would automatically notify if a piece of data has not been refreshed according to the frequency declared in its metadata.
  • Success rate: In the world of data APIs, it's not enough to just receive an HTTP 200 (OK) code to determine if the answer is valid. If the response is, for example, a JSON with no content, the service is not useful. The service level would have to measure the rate of successful responses with valid content, ensuring that the endpoint not only responds, but delivers the expected information.

A first step, SLA, SLO, and SLI: measure before committing

Since establishing these types of commitments is really complex, a possible strategy to take action gradually is to adopt a pragmatic approach based on industry best practices. For example, in reliability engineering, a hierarchy of three concepts is proposed that helps avoid unrealistic compromises:

  • Service Level Indicator (SLI): it is the measurable and quantitative indicator. It represents the technical reality at a given moment. Examples of SLI in open data could be the "percentage of successful API requests", "p95 latency" (the response time of 95% of requests) or the "percentage of download links that do not return error".
  • Service Level Objective (SLO): this is the internal objective set for this indicator. For example: "we want 99.5% of downloads to work correctly" or "p95 latency must be less than 800ms". It is the goal that guides the work of the technical team.
  • Service Level Agreement (SLA): is the public and formal commitment to those objectives. This is the promise that the data portal makes to its community of reusers and that includes, ideally, the communication channels and the protocols for action in the event of non-compliance.

Infographic titled “What are SLI, SLO, and SLA for?”. On the left, three overlapping circles contain icons: a ruler (measurement), a target (goals), and a handshake (agreement). On the right, three numbered explanations appear:  SLI – Measure the service: used to understand how a system or data portal is actually performing; without measurement, you cannot know whether things are going well or poorly.  SLO – Set objectives: provides clear goals for the team and helps prioritize efforts to improve the service where it is most needed.  SLA – Build trust: tells users what they can expect from the service and represents a public commitment to quality. At the bottom, the datos.gob.es logo appears with the phrase “boosting our digital economy” and the note “Source: own elaboration – datos.gob.es”.

Figure 1. Visual to explain the difference between SLI, SLO and SLA. Source: own elaboration - datos.gob.es.

This distinction is especially valuable in the open data ecosystem due to the hybrid nature of a service in which not only an infrastructure is operated, but the data lifecycle is managed.

In many cases, the first step might be not so much to publish an ambitious SLA right away, but to start by defining your SLIs and looking at your SLOs. Once measurement was automated and service levels stabilized and predictable, it would be time to turn them into a public commitment (SLA).

Ultimately, implementing service tiers in open data could have a multiplier effect. Not only would it reduce technical friction for developers and improve the reuse rate, but it would make it easier to integrate public data into AI systems and autonomous agents. New uses such as the evaluation of generative Artificial Intelligence systems, the generation and validation of synthetic datasets or even the improvement of the quality of open data itself would benefit greatly.

Establishing a data SLA would, above all, be a powerful message: it would mean that the public sector not only publishes data as an administrative act, but operates it as a digital service that is highly available, reliable, predictable and, ultimately, prepared for the challenges of the data economy.

Content created by Jose Luis Marín, Senior Consultant in Data, Strategy, Innovation & Digitalisation. The content and views expressed in this publication are the sole responsibility of the author.

calendar icon
Documentación

Introduction

Every year there are tens of thousands of accidents in Spain, in which thousands of people are injured of varying degrees, and which occur in very different circumstances, both in terms of the type of road and the type of accident.

Many of the statistics related to these parameters are collected in the databases of the Directorate General of Traffic (DGT) and some of them in the catalogue hosted in datos.gob.es.

In this exercise, you will examine the content of the DGT accident database for the year 2024 in order to make a series of basic visualizations that allow us to quickly and intuitively see which are the facts to highlight regarding the incidence of accidents and their consequences in that year.

To do this, we are going to develop Python code that allows us to read and calculate basic metrics regarding the total number of victims, the particularities of the infrastructures as well as the different cases of accidents. And once we have this data available, we will visualize it using the Javascript D3.js library, which allows us both to represent data in its most traditional form and in more contemporary designs, common in the press, thus favoring a narrative that is fluid in style and coherent in content.

In the Python environment we will use commonly and frequently used libraries such as Numpy, for basic calculation - sums, maximums and minimums, and Pandas, to structure the data intuitively, facilitating both its organization and its transformation. We will also work with Datetime, both for the formatting of the input data in standard date types within the world of Python programming, and to add the data in an easy and intuitive way. In this way we will learn how to open any type of data file in . CSV, to structure it in an orderly way and to carry out basic transformations and operations in a simple way.

In the Javascript environment we will develop notebooks in D3.js thanks to the use of Observable, an open and free initiative, to be able to execute Javascript code directly in a web interface, and without having to resort to local servers or complex installations. In different notebooks we will create classic visualizations -such as time series on

Cartesian axes or maps- along with other proposals such as bubble distributions or elements stacked by categories.

In Figure 1 you can see the main stages of this exercise, from the reading of the data within the DGT file, to the operations and output variables in JSON format, which will in turn serve us in a Javascript environment to be able to develop the visualizations in D3.js.

Steps to be followed when performing this exercise, from reading the input CSV file, postprocessing the data with Python, creating an output in JSON format and ultimately displaying the information in D3.js

Figure 1. Steps to be followed when performing this exercise, from reading the input CSV file, postprocessing the data with Python, creating an output in JSON format and ultimately displaying the information in D3.js

Access to the Github repositories, GoogleColab notebook and Observable notebooks is done via:

Access to the Github repository

Access to GoogleColab notebook

Access to Observable notebooks

Development Process 

1. Reading the data file 

The first step will be to read the DGT file containing all the accident records for the year 2024. This step will allow us to identify the fields of interest and especially in what format they are. We will be able to identify if any transformation is required, especially in the information of the date, as it is structured in the original file.

We will also see how to translate the codes of many of the categories offered by the DGT, so that we can make a real interpretation beyond the numbers of categories such as type of accident, type of road or ownership of the road.

Once we understand the structure and content of the data, we can start operating with it.

2. Calculating Metrics

The Pandas Python library allows us to operate with the different columns of data and perform basic calculations that will be representative enough to minimally understand the casuistry of accidents on Spanish roads.

In this section, three types of calculations will be made.

  • The first of these will be the calculation of the total number of victims per hour of the day for each of the days of the week. The DGT database is structured by day of the week, so we will also use this time scale to represent the data in a series. It should be noted that avictim is considered to be any person who has died or who is diagnosed as seriously or lightly injured.
  • The second calculation will be the sum of the total of accidents for different categories, such as road ownership, type of accident or type of road. This will allow us to see which are the conditions in which accidents are most frequent.
  • The third calculation will be the number of accidents per municipality. In this case we will carry out the calculation restricted to the province of Valencia as an example, and which would be applicable to any province or municipality of our interest. In this case we will observe the differences between urban and non-urban centers, as well as those municipalities through which the main communication routes pass.

3. Visualization Design 

Once we have calculated the metrics of interest, we will develop four visualization exercises in D3.js. To do this, we will export the result of the metrics in JSON format and create notebooks in Observable. Specifically, we made the following visualizations:

  • Time series with the total number of casualties in each hour and day of the week, with an interactive drop-down menu to select the day of the week of interest. In addition to the curve that describes the number of victims, we will draw the uncertainty of all the days of the week on the background of the graph, so that the daily time series is framed in the context of the whole week as a reference.
  • Map of the province of Valencia with the total number of accidents by municipality.
  • Bubble diagram, with the different magnitudes of the different types of accidents with the total number of accidents in each case written in detail.
  • Stacked dot diagram, where we accumulate circles or any other geometric shape for the different road ownership and its total number of accidents within the framework of each ownership.
  • Mountain ridge diagram, where the height of each mountain represents the total number of victims on a logarithmic scale. 

Viewing metrics  

The result of this exercise can be seen graphically and explicitly in the form of visualizations made for the web format and accessible from a web interface, both for its development and for its subsequent publication. These visualizations are gathered as Observable notebooks here:

Access to Observable notebooks

In Figure 2 we have the result of the time series of the total number of victims with respect to the time of day for different days of the week. The time series is framed within the uncertainty of the total number of days of the week, to give an idea of the margin of variability that we can have depending on the time of day.

Time series of total accident casualties by time of day for all days of the week in 2024. The light blue background indicates the uncertainty associated with all the days of the week as context, with a drop-down menu to select the day of the week.

Figure 2. Time series of total accident casualties by time of day for all days of the week in 2024. The light blue background indicates the uncertainty associated with all the days of the week as context, with a drop-down menu to select the day of the week.

In Figure 3 we can see the map of the province of Valencia with a colour intensity proportional to the number of accidents in each municipality. Those municipalities in which no accidents have been recorded appear in white. Intuitively you can guess the layout of the main roads that cross the province, both the road to the east of the city of Valencia in the direction of Madrid and the inland road to the south of the city in the direction of Alicante 

Map of the number of accidents by municipality in the province of Valencia in 2024.

Figure 3. Map of the number of accidents by municipality in the province of Valencia in 2024.

In Figure 4 we see a geometric shape, the circle, associated with the types of accidents, with the detail of the number of accidents associated with each category. In this type of visualization, the most frequent accidents around the center of the diagram naturally emerge, while those that are minority or residual occupy the perimeter of the diagram to also give a round shape to the set of shapes

Bubble diagram of the number of accidents by accident type in 2024.

Figure 4. Bubble diagram of the number of accidents by accident type in 2024. 

Figure 5 shows the traditional bar diagram, but this time broken down into smaller units, to refine the number of accidents associated with the ownership of the road where they have occurred. This type of diagram allows us to discern small differences between similar quantities, preserving the general message that we obtain from a calculation of these characteristics.  

Bar diagram with dot discretization for the number of accidents by road ownership in 2024

Figure 5. Bar diagram with dot discretization for the number of accidents by road ownership in 2024 

Figure 6 shows the total number of victims on a logarithmic scale based on the height of each mountain for each type of road. 

Mountain ridge diagram, displaying the total number of victims by each type of road in 2024.

Figure 6. Mountain ridge diagram, displaying the total number of victims by each type of road in 2024. 

Lessons learned 

Through these steps we will learn a whole series of transversal skills that allow us to work with those datasets that are presented to us in CSV format in columns, a very popular format for which we can perform both their analysis and their visualization. These lessons are specifically:

  • Universality of reading and structuring data: the use of tools such as Python, with its Numpy and Pandas libraries, allows access to data in detail and structured in an orderly and intuitive way with a few lines of code.
  • Simple calculations in Pandas: the Python library itself allows simple but essential calculations for the preliminary interpretation of results.
  • Datetime format: through this Python library we can become familiar with the standard date format, and thus perform all kinds of transformations, filters and selections that interest us the most in any time interval.
  • JSON format: once we decide to give space to our visualizations on the web, learning the structure and use of the JSON format is very useful given its wide use in all types of applications and web architectures.
  • Spectrum of D3.js possibilities: this Javascript library allows us to explore from the most traditional and conservative to the most creative thanks to its principles based on the most basic shapes, without templates, templates or predefined diagrams.

Conclusions and next steps 

We have learned to read and structure data according to the standards of the most widely used formats in the world of analysis and visualization. This exercise also serves as an introductory module to the world of D3.js, a very versatile, current and popular tool within the world of storytelling and data visualization at all levels.

In order to move forward in this exercise, it is recommended:

  • For analysts and developers, it is possible to dispense with the Pandas library and structure the data with more elementary Python objects such as arrays and matrices, looking for which functions and which operators allow the same tasks that Pandas does to be performed but in a more fundamental way, especially if we think of production environments for which we need the fewest possible libraries to lighten the application.
  • For the creators of visualizations, information on municipalities can also be projected onto existing cartographic databases such as OpenStreetMap and thus link the incidence of accidents to orographic features or infrastructures already reflected in these cartographic databases. For the magnitudes of the accident numbers, you can explore Treemap diagrams or Voronoi diagrams and see if they convey the same message as the ones presented in this exercise.  

Areas of application 

Los pasos descritos en este ejercicio pueden pasar a formar parte de cualquier caja de herramientas de uso habitual para los siguientes perfiles:  

  • Data analysts: here are the basic steps for the description of a data file in CSV format and the basic calculations to be carried out both in the date field and operations between variables of different columns. These tools can be used to introduce you to the world of data analysis and help in those first steps when facing a dataset.
  • Scientists and research staff: the universality of the tools described here apply to a wide variety of data sources, such as that experienced in experimental sciences and observations or measurements of all kinds. These tools allow for a quick and rigorous analysis regardless of the field of knowledge in which you work.
  • Web developers: the export of data in JSON format as well as the Javascript code offered in Observable notebooks are easily integrated into all types of environments (Svelte, React, Angular, Vue) and allow the creation of visualizations on a website in a simple and intuitive way.
  • Journalists: covering the entire life process of a data file, from its reading to its visualization, gives the journalist or researcher independence when it comes to evaluating and interpreting the data by himself without depending on external technical resources. The creation of the map by municipalities opens the door to using any other similar data, such as electoral processes, with the same output format to show geographical variability with respect to any type of magnitude.
  • Graphic Designers: Handling visualization tools with a wide degree of freedom allows designers to cultivate all their creativity within the rigor and accuracy that data requires.
calendar icon
Blog

La The European Commission has recently presented the document setting out a new EU Strategy in the field of data. Among other ambitious objectives, this initiative aims to address a transcendental challenge in the era of generative artificial intelligence: the insufficient availability of data under the right conditions.

Since the previous 2020 Strategy, we have witnessed an important regulatory advance that aimed to go beyond the 2019 regulation on open data and reuse of public sector information. 

Specifically, on the one hand, the Data Governance Act served to promote a series of measures that tended to facilitate the use of data generated by the public sector in those cases where other legal rights and interests were affected – personal data, intellectual property. 

On the other hand, through the Data Act, progress was made, above all, in the line of promoting access to data held by private subjects, taking into account the singularities of the digital environment.

The necessary change of focus in the regulation on access to data. 

Despite this significant regulatory effort, the European Commission has detected an underuse of data , which is also often fragmented in terms of the conditions of its accessibility. This is due, in large part, to the existence of significant regulatory diversity. Measures are therefore needed to facilitate the simplification and streamlining of the European regulatory framework on data. 

Specifically, it has been found that there is regulatory fragmentation that generates legal uncertainty and disproportionate compliance costs due to the complexity of the applicable regulatory framework itself. Specifically, the overlap between the General Data Protection Regulation (GDPR), the Data Governance Act, the Data Act, the Open Data Directive and, likewise, the existence of sectoral regulations specific to some specific areas has generated a complex regulatory framework which is difficult to face, especially if we think about the competitiveness of small and medium-sized companies. Each of these standards was designed to address specific challenges that were addressed successively, so a more coherent overview is needed to resolve potential inconsistencies and ultimately facilitate their practical implementation.

In this regard, the Strategy proposes to promote a new legislative instrument – the proposal for a Regulation called Digital Omnibus – which aims to consolidate the rules relating to the European single market in the field of data into a single standard. Specifically, with this initiative:

  • The provisions of the Data Governance Act  are merged into the regulation of the Data Act, thus eliminating duplications.
  • The Regulation on non-personal data, whose functions are also covered by the Data Act, is repealed;
  • Public sector data standards are integrated into the Data Act, as they were previously included in both the 2019 Directive and the Data Governance Act. 

This regulation therefore consolidates the role of the Data Act as a general reference standard in the field. It also strengthens the clarity and precision of its forecasts, with the aim of facilitating its role as the main regulatory instrument through which it is intended to promote the accessibility of data in the European digital market.

Modifications in terms of personal data protection

The Digital Omnibus proposal  also includes important new features with regard to the regulations on the protection of personal data, amending several provisions of Regulation (EU) 1016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data.

In order for personal data to be used – that is, any information referring to an identified or identifiable natural person – it is necessary that one of the circumstances referred to in Article 6 of the aforementioned Regulation is present, including the consent of the owner or the existence of a legitimate interest on the part of the person who is going to process the data.

Legitimate interest allows personal data to be processed when it is necessary for a valid purpose (improving a service, preventing fraud, etc.) and does not adversely affect the rights of the individual.

Source: Guide on legitimate interest. ISMS Forum and Data Privacy Institute. Available here: guiaintereslegitimo1637794373.pdf

Regarding the possibility of resorting to legitimate interest as a legal basis for training artificial intelligence tools, the current regulation allows the processing of personal data as long as the rights of the interested parties who own such data do not prevail

However, given the generality of the concept of "legitimate interest", when deciding when personal data may be used under this clause , there will not always be absolute certainty, it will be necessary to analyse on a case-by-case basis: specifically, it will be necessary to carry out an activity of weighing the conflicting legal interests and,  therefore, its application may give rise to reasonable doubts in many cases.

Although the European Data Protection Board has tried to establish some guidelines to specify the application of legitimate interest, the truth is that the use of open and indeterminate legal concepts will not always allow clear and definitive answers to be reached. To facilitate the specification of this expression in each case, the Strategy refers as a criterion to takeinto account the potential benefit that the processing may entail for the data subject and for society in general. Likewise, given that the consent of the owner of the data will not be necessary – and therefore, its revocation would not be applicable – it reinforces the right of opposition by the owner to the processing of their data and, above all, guarantees greater transparency regarding the conditions under which the data will be processed. Thus, by strengthening the legal position of the data subject and referring to this potential benefit, the Strategy aims to facilitate the use of legitimate interest as a legal basis for the use of personal data without the consent of the data subject, but with appropriate safeguards.

Another major data protection measure concerns the distinction between anonymised and pseudonymised data. The GDPR defines pseudonymisation as data processing that, until now, could no longer be attributed to a data subject without recourse to additional, separate information. However, pseudonymised data is still personal data and, therefore, subject to this regulation. On the other hand, anonymous data does not relate to identified or identifiable persons and therefore its use would not be subject to the GDPR. Consequently, in order to know whether we are talking about anonymous or pseudo-nimized data, it is essential to specify whether there is a "reasonable probability" of identifying the owner of the data.

However, the technologies currently available multiply the risk of re-identification of the data subject, which directly affects what could be considered reasonable, generating uncertainty that has a negative impact on technological innovation. For this reason, the Digital Omnibus proposal, along the lines already stated by the Court of Justice of the European Union, aims to establish the conditions under which pseudonymised data could no longer be considered personal data, thus facilitating its use. To this end, it empowers the European Commission, through implementing acts, to specify such circumstances, in particular taking into account the state of the art and, likewise, offering criteria that allow the risk of re-identification to be assessed in each specific case. 

Scaling High-Value Datasets

The Strategy also aims to expand the catalogue of High Value Data (HVD) provided for in Implementing Regulation (EU) 2023/138. These are datasets with exceptional potential to generate social, economic and environmental benefits, as they are high-quality, structured and reliable data that are accessible under technical, organisational and semantic conditions that are very favourable for automated processing. Six categories are currently included (geospatial, Earth observation and environment, meteorology, statistics, business and mobility), to which the Commission would add, among other things, legal, judicial and administrative data.

Opportunity and challenge

The European Data Strategy represents a paradigmatic shift that is certainly relevant: it is not only a matter of promoting regulatory frameworks that facilitate the accessibility of data at a theoretical level but, above all, of making them work in their practical application, thus promoting the necessary conditions of legal certainty that allow a competitive and innovative data economy to be energized.

To this end, it is essential, on the one hand, to assess the real impact of the measures proposed through the Digital Omnibus and, on the other, to offer small and medium-sized enterprises appropriate legal instruments – practical guides, suitable advisory services, standard contractual clauses, etc. – to face the challenge that regulatory compliance poses for them in a context of enormous complexity. Precisely, this difficulty requires, on the part of the supervisory authorities and, in general, of public entities, to adopt advanced and flexible data governance models that adapt to the singularities posed by artificial intelligence, without affecting legal guarantees.

Content prepared by Julián Valero, professor at the University of Murcia and coordinator of the Innovation, Law, and Technology Research Group (iDerTec). The content and views expressed in this publication are the sole responsibility of the author.

calendar icon
Application

Web viewer that displays the fiber deployments of all PEBA and UNICO programs on a single map, based on publicly available data. Each area has the background color of the awarded operator, and the border is a different color for each program. In the case of the 2013-2019 PEBA plan, as deployments are assigned to individual population entities, a marker is shown with the location obtained from the CNIG. In addition, when the map is not zoomed in, a heat map is displayed showing the distribution of deployments by area.

This visualization avoids having to compare different map viewers if what we are interested in is seeing which operators reach which areas or simply having an overview of which deployments are pending in my area. It also allows us to consult aspects such as the updated completion date, which were previously only available in the different Excel files for each program. I also think it could be useful for analyzing how the areas are distributed among the different programs (for example, if an area covered in UNICO 2021 then has nearby areas in UNICO 2022 covered by another operator, for example), or even possible overlaps (for example, due to areas that were not executed in previous programs).

calendar icon
Noticia

Did you know that Spain created the first state agency specifically dedicated to the supervision of artificial intelligence (AI) in 2023? Even anticipating the European Regulation in this area, the Spanish Agency for the Supervision of Artificial Intelligence (AESIA) was born with the aim of guaranteeing the ethical and safe use of AI, promoting responsible technological development.

Among its main functions is to ensure that both public and private entities comply with current regulations. To this end, it promotes good practices and advises on compliance with the European regulatory framework, which is why it has recently published a series of guides to ensure the consistent application of the European AI regulation.

In this post we will delve into what the AESIA is and we will learn relevant details of the content of the guides.

What is AESIA and why is it key to the data ecosystem?

The AESIA was created within the framework  of Axis 3 of the Spanish AI Strategy. Its creation responds to the need to have an independent authority that not only supervises, but also guides the deployment of algorithmic systems in our society.

Unlike other purely sanctioning bodies, the AESIA is designed as an intelligence Think & Do, i.e. an organisation that investigates and proposes solutions. Its practical usefulness is divided into three aspects:

  1. Legal certainty: Provides clear frameworks for businesses, especially SMEs, to know where to go when innovating.
  2. International benchmark: it acts as the Spanish interlocutor before the European Commission, ensuring that the voice of our technological ecosystem is heard in the development of European standards.
  3. Citizen trust: ensures that AI systems used in public services or critical areas respect fundamental rights, avoiding bias and promoting transparency.

Since datos.gob.es, we have always defended that the value of data lies in its quality and accessibility. The AESIA complements this vision by ensuring that, once data is transformed into AI models, its use is responsible. As such, these guides are a natural extension of our regular resources on data governance and openness.

Resources for the use of AI: guides and checklists

The AESIA has recently published materials to support the implementation and compliance with the European Artificial Intelligence regulations and their applicable obligations. Although they are not binding and do not replace or develop existing regulations, they provide practical recommendations aligned with regulatory requirements pending the adoption of harmonised implementing rules for all Member States.

They are the direct result of the Spanish AI Regulatory Sandbox pilot. This sandbox allowed developers and authorities to collaborate in a controlled space to understand how to apply European regulations in real-world use cases.

It is essential to note that these documents are published without prejudice to the technical guides that the European Commission is preparing. Indeed, Spain is serving as a "laboratory" for Europe: the lessons learned here will provide a solid basis for the Commission's working group, ensuring consistent application of the regulation in all Member States.

The guides are designed to be a complete roadmap, from the conception of the system to its monitoring once it is on the market.

AESIA guidelines for regulatory compliance  Introduction to AI regulations: obligations, deadlines, and roles  Examples to understand AI regulations: use cases  Conformity assessment: for marketing a high-risk AI system  Quality management system: for maintaining standards  Risk management: identifying and mitigating negative impacts  Human oversight: how supervision should be carried out  Data and governance: on model training and evaluation  Transparency: to inform the user  Accuracy: to measure whether it meets its objective  10. Robustness: to ensure a robust and validated model  11. Cybersecurity: to protect AI systems from potential attacks  12, 13, and 14: Records, post-market surveillance, and incident management: to continue reviewing once the process is complete  15. Technical documentation: how to create and maintain it  16. Checklist manual: to check everything

Figure 1. AESIA guidelines for regulatory compliance. Source: Spanish Agency for the Supervision of Artificial Intelligence

  • 01. Introductory to the AI Regulation: provides an overview of obligations, implementation deadlines and roles (suppliers, deployers, etc.). It is the essential starting point for any organization that develops or deploys AI systems.
  • 02. Practice and examples: land legal concepts in everyday use cases (e.g., is my personnel selection system a high-risk AI?). It includes decision trees and a glossary of key terms from Article 3 of the Regulation, helping to determine whether a specific system is regulated, what level of risk it has, and what obligations are applicable.
  • 03. Conformity assessment: explains the technical steps necessary to obtain the "seal" that allows a high-risk AI system to be marketed, detailing the two possible procedures according to Annexes VI and VII of the Regulation as valuation based on internal control or evaluation with the intervention of a notified body.
  • 04. Quality management system: defines how organizations must structure their internal processes to maintain constant standards. It covers the regulatory compliance strategy, design techniques and procedures, examination and validation systems, among others.
  • 05. Risk management: it is a manual on how to identify, evaluate and mitigate possible negative impacts of the system throughout its life cycle.
  • 06. Human surveillance: details the mechanisms so that AI decisions are always monitorable by people, avoiding the technological "black box". It establishes principles such as understanding capabilities and limitations, interpretation of results, authority not to use the system or override decisions.
  • 07. Data and data governance: addresses the practices needed to train, validate, and test AI models ensuring that datasets are relevant, representative, accurate, and complete. It covers data management processes (design, collection, analysis, labeling, storage, etc.), bias detection and mitigation, compliance with the General Data Protection Regulation, data lineage, and design hypothesis documentation, being of particular interest to the open data community and data scientists.
  • 08. Transparency: establishes how to inform the user that they are interacting with an AI and how to explain the reasoning behind an algorithmic result.
  • 09. Accuracy: Define appropriate metrics based on the type of system to ensure that the AI model meets its goal.
  • 10. Robustness: Provides technical guidance on how to ensure AI systems operate reliably and consistently under varying conditions.
  • 11. Cybersecurity: instructs on protection against threats specific to the field of AI.
  • 12. Logs: defines the measures to comply with the obligations of automatic registration of events.
  • 13. Post-market surveillance: documents the processes for executing the monitoring plan, documentation and analysis of data on the performance of the system throughout its useful life.
  • 14. Incident management: describes the procedure for reporting serious incidents to the competent authorities.
  • 15. Technical documentation: establishes the complete structure that the technical documentation must include (development process, training/validation/test data, applied risk management, performance and metrics, human supervision, etc.).
  • 16.  Requirements Guides Checklist Manual:  explains how to use the 13  self-diagnosis checklists that allow compliance assessment, identifying gaps, designing adaptation plans and prioritizing improvement actions.

All guides are available here and have a modular structure that accommodates different levels of knowledge and business needs.

The self-diagnostic tool and its advantages

In parallel, the AESIA publishes material that facilitates the translation of abstract requirements into concrete and verifiable questions, providing a practical tool for the continuous assessment of the degree of compliance.

These are checklists that allow an entity to assess its level of compliance autonomously.

The use of these checklists provides multiple benefits to organizations. First, they facilitate the early identification of compliance gaps, allowing organizations to take corrective action prior to the commercialization or commissioning of the system. They also promote a systematic and structured approach to regulatory compliance. By following the structure of the rules of procedure, they ensure that no essential requirement is left unassessed.

On the other hand, they facilitate communication between technical, legal and management teams, providing a common language and a shared reference to discuss regulatory compliance. And finally, checklists  serve as a documentary basis for demonstrating due diligence to supervisory authorities.

We must understand that these documents are not static. They are subject to an ongoing process of evaluation and review. In this regard, the EASIA continues to develop its operational capacity and expand its compliance support tools.

From the open data platform of the Government of Spain, we invite you to explore these resources. AI development must go hand in hand with well-governed data and ethical oversight.

calendar icon