European Data Portal shares report with some standards for homogenising high-value data

Share

Fecha del documento: 23-01-2024

In order to comply with Directive (EU) 2019/1024 and its subsequent implementing regulation, EU member states are working on making available so-called high-value datasets (HVD). The aim is to enable citizens and businesses to access such data under technical requirements that favour its re-use and its positive impact on society, the economy and the environment.

Opening up these datasets is a major challenge for public administrations in all EU countries. Although much of this data is already available tousers, countries need to identify it in order to be able to report on it and resolve the high heterogeneity in formats, structures and semantics. In particular, from February 2025, Member States will have to report to the Commission every two years on available high-value datasets, including links to licence conditions and APIs.

To assist in this task, the European Data Portal has published the report "Report on Data Homogenisation for High-value Datasets" where it proposes a methodological approach to facilitate the identification and homogenisation of HVD. Among other issues, the report provides examples of standards that help to achieve greater interoperability not only between data, but also between the applications that use them.

A method for identification and homogenisation

The report describes a methodological approach based on three steps:

The identification of HVDs in existing data portals. Although there are some guidelines for HVD publication, like these for applying DCAT-AP, the naming of already published datasets is not uniform, which makes it difficult to find them. The report proposes a protocol that consists of defining keywords, based on the datasets and their associated attributes, contained in Annex I of the Implementing Regulation. The idea is to use these keywords to search the various existing data portals. The report explains how the identification protocol has been tested with datasets from the categories of business registers, statistical data and transport network data, including tables with the keywords used.
Localisation or development of data models, ontologies, controlled vocabularies and/or common APIs. In this section, the report describes some useful resources, which are summarised in the following table:

Resource	DESCRIPTION	Category of data in which they can help the most, according to the report
Inspire Directive	Characteristics that spatial information and its metadata must have.	Geospatial data Earth Observation and environmental data. Meteorological data Data on transport networks.
Inspire Directive data specifications (data specifications)	Models, schemes and coding rules for different spatial data thematic areas.	Geospatial data Earth Observation and Environmental Data Meteorological data Data on transport networks.
Inspire network services (network services)	A set of common interfaces for web services that enable the discovery, visualisation, downloading and transformation of spatial data.	Geospatial data Earth Observation and Environmental Data Meteorological data Data on transport networks.
Technical guidelines for Inspire metadata (Inspire technical guidelines for metadata)	Technical guidelines for metadata, with the minimum elements to be included as defined in Commission Regulation 1205/2008 .	Geospatial data Earth Observation and Environmental Data Meteorological data Data on transport networks.
Geo-DCATAP	Extension of the DCAT application profile to describe geospatial datasets.	Geospatial data
Core Location Vocabulary	A simplified data model that includes the fundamental characteristics of a location, represented as an address or geographic name, or through geometry.	Geospatial data
General Multilingual Environmental Thesaurus (GEMET).	Controlled vocabulary specialised in environmental information. It has a section on concepts linked to the spatial data categories included in Inspire.	Geospatial data Earth Observation Data Data on transport networks.
Semantic Sensor Network	W3C recommendation for describing sensors and their observations.	Meteorological data
Quantity, unit, dimension and type (QUDT).	A set of ontologies defining basic classes, properties and constraints used to model physical quantities, units of measurement and their dimensions in various measurement systems.	Meteorological data
List of Eurostat statistical classifications	Statistical classifications maintained by Eurostat, available as Linked Open Data in XKOS, the SKOS extension for modelling statistical classifications. They are presented by classification family, categorised by statistical domain and sub-domains (e.g. NACE for economic activity, which we will describe below).	Statistical data
Eurostat standard code lists	Predefined and organised sets of elements presenting statistical concepts using unique codes	Statistical data
Statistical Data and Metadata eXchange (SDMX)	Global initiative to standardise and harmonise the exchange of statistical data and metadata. It provides technical standards (the SDMX information model), guidelines, an IT architecture, tools and a series of tutorials to assist users.	Statistical data
RDF Data Cube Vocabulary	Ontology for describing multidimensional data, such as statistics, which is based on the core of the SDMX 2.0 information model.	Statistical data
Core Business Vocabulary	Referred to by the regulation itself, it consists of a simplified data model that captures the fundamental characteristics of a legal entity, such as its legal name, activity or address.	Business registers
NACE Code	Codes for the classification of economic activities in the European Union. Its NACE 2 revision was published by the European Commission in October 2022	Business registers
Organisation ontology	W3C ontology to support the publication of linked data relating to organisational information, i.e. it provides a number of ways to represent the relationship between people and organisations, together with the internal information structure of an organisation.	Business registers
Global Legal Entity Identifier Foundation	Centralised database with information on legal entities participating in global financial markets. It assigns each entity a unique Legal Entity Identifier (LEI) code that is recognised worldwide.	Business registers
NST Taxonomy	Classification system for goods transported by road, rail, inland waterways and sea. It takes into account the economic activity associated with the origin of the goods.	Data on transport networks.
Table of authorities of "Transport service"	List of codes for different types of transport services provided by the EU Vocabularies section.	Data on transport networks.

Source: Report on Data Homogenisation for High-value Datasets

The report also mentions some models to be used in the field of smart cities, such as Smart Data Models and the Spanish Open Cities.

The application of such models. The last step is the actual harmonisation of the data. Once the models to be used have been selected, it is time to apply them. In this phase, the necessary conversion processes will be carried out to provide the data in the appropriate formats and with unified quality metadata. The way in which these transformations are applied will vary depending on the intended end result. For example, it may consist of transforming tabular data (comma-separated values or CSVs, Excel, relational databases, etc.) into other data sources that are also tabular but follow the structure provided in common data models. You can also go further and transform them into tree-based representations (such as JSON) or RDF according to the ontologies and controlled vocabularies you select.

Conclusions of the report

The report ends with a series of conclusions and recommendations. There are still challenges around the identification of HVDs and the implementation of the Implementing Regulation in all European countries, especially in raising awareness and disseminating information about their importance. In HVD categories where there are large data harmonisation initiatives, such as Inspire on geospatial data or Eurostat on statistical HVD, we can find a larger amount of data available in an interoperable and harmonised way. In contrast, in categories where there is no majority initiative, such as companies and company ownership, there is still some way to go to implement the regulation.

The recommendations set out in the European Data Portal report help to shape a roadmap for publishing high-value datasets in each of the categories defined by the European Commission. A challenge that administrations will have to address during 2024 and that will facilitate the re-use of public information.