Documentación

The following is a definition of various terms regarding data and related technologies.

1. Glossary of terms related to open data.

(You can download the accessible version here)

 

2. Glossary of terms related to new technologies and data

(You can download the accessible version here)
calendar icon
Blog

Open data plays a relevant role in technological development for many reasons. For example, it is a fundamental component in informed decision making, in process evaluation or even in driving technological innovation. Provided they are of the highest quality, up-to-date and ethically sound, data can be the key ingredient for the success of a project.

In order to fully exploit the benefits of open data in society, the European Union has several initiatives to promote the data economy, a single digital model that encourages data sharing, emphasizing data sovereignty and data governance, the ideal and necessary framework for open data.

In the data economy, as stated in current regulations, the privacy of individuals and the interoperability of data are guaranteed. The regulatory framework is responsible for ensuring compliance with this premise.  An example of this can be the modification of Law 37/2007 for the reuse of public sector information in compliance with European Directive 2019/1024. This regulation is aligned with the European Union's Data Strategy, which defines a horizon with a single data market in which a mutual, free and secure exchange between the public and private sectors is facilitated.

To achieve this goal, key issues must be addressed, such as preserving certain legal safeguards or agreeing on common metadata description characteristics that datasets must meet to facilitate cross-industry data access and use, i.e. using a common language to enable interoperability between dataset catalogs.

What are metadata standards?

A first step towards data interoperability and reuse is to develop mechanisms that enable a homogeneous description of the data and that, in addition, this description is easily interpretable and processable by both humans and machines. In this sense, different vocabularies have been created that, over time, have been agreed upon until they have become standards. 

Standardized vocabularies offer semantics that serve as a basis for the publication of data sets and act as a "legend" to facilitate understanding of the data content. In the end, it can be said that these vocabularies provide a collection of metadata to describe the data being published; and since all users of that data have access to the metadata and understand its meaning, it is easier to interoperate and reuse the data.

W3C: DCAT and DCAT-AP Standards

At the international level, several organizations that create and maintain standards can be highlighted:

  • World Wide Web Consortium (W3C): developed the Data Catalog Vocabulary (DCAT): a description standard designed with the aim of facilitating interoperability between catalogs of datasets published on the web.
    • Subsequently, taking DCAT as a basis, DCAT-AP was developed, a specification for the exchange of data descriptions published in data portals in Europe that has more specific DCAT-AP extensions such as:
      • GeoDCAT-AP which extends DCAT-AP for the publication of spatial data.
      • StatDCAT-AP which also extends DCAT-AP to describe statistical content datasets.

ISO: Organización de Estandarización Internacional

Además de World Wide Web Consortium, existen otras organizaciones que se dedican a la estandarización, por ejemplo, la Organización de Estandarización Internacional (ISO, por sus siglas en inglés Internacional Standarization Organisation).

  • Entre otros muchos tipos de estándares, ISO también ha definido normas de estandarización de metadatos de catálogos de datos:
    • ISO 19115 para describir información geográfica. Como ocurre en DCAT, también se han desarrollado extensiones y especificaciones técnicas a partir de ISO 19115, por ejemplo:
      • ISO 19115-2 para datos ráster e imágenes.
      • ISO 19139 proporciona una implementación en XML del vocabulario.

The horizon in metadata standards: challenges and opportunities

 
Both W3C and ISO are working on the development and maintenance of standardized vocabularies adapted to the needs of users. Their work contributes to achieving an interoperable open data ecosystem that facilitates reuse. However, interoperability often encounters obstacles arising from quality weaknesses, such as outdated data, difficulties in accessing and interoperating with it, or incomplete metadata.
 
However, as has been demonstrated, data sharing is a fundamental mechanism in the data economy. So ensuring the interoperability and reuse of data is a key action to address the development of the data economy in line with the expectations of organizations in terms of innovation.
 
Among the multiple advantages offered by the reuse of datasets and their interoperability, we can highlight the creation of applications and services that bring value to society or help in the evaluation of policies, for example.
 
In addition, the reuse and interoperability of datasets favors economic development in general, and the data economy in particular. It is estimated that this industry will reach a value of 829 billion euros by 2025, according to European Union forecasts. In order to reap the benefits of data sharing, common description standards must first be agreed upon and adhered to: the standards for describing dataset catalog metadata.
calendar icon
Blog

Last December, the Congress of Deputies approved Royal Decree-Law 24/2021, which included the transposition of Directive (EU) 2019/1024 on open data and the reuse of public sector information. This Royal Decree amends Law 37/2007 on the reuse of public sector information, including new requirements for public bodies, including facilitating access to high-value data

High-value data are data whose reuse is associated with considerable benefits to society, the environment and the economy. Initially, the European Commission highlighted as high-value data those belonging to the categories of geospatial, environmental, meteorological, statistical, societal and mobility data, although these classes can be extended both by the Commission and by the Ministry of Economic Affairs and Digital Transformation through the Data Office. According to the Directive, this type of data "shall be made available for reuse in a machine-readable format, through appropriate application programming interfaces and, where appropriate, in the form of bulk download". In other words, among other things, an API is required.

What is an API?

An application programming interface or API is a set of definitions and protocols that enable the exchange of information between systems. It should be noted that there are different types of APIs based on their architecture, communication protocols and operating systems.

APIs offer a number of advantages for developers, since they automate data and metadata consumption, facilitate mass downloading and optimize information retrieval by supporting filtering, sorting and paging functionalities. All of this results in both economic and time savings. 

In this sense, many open data portals in our country already have their own APIs to facilitate access to data and metadata. In the following infographic you can see some examples at national, regional and local level, including information about the API of datos.gob.es. The infographic also includes brief information on what an API is and what is needed to use it.

APIs para el acceso a datos abiertos y/o sus metadatos

Click here to see the infographic in full size and in its accessible version

These examples show the effort that public agencies in our country are making to facilitate access to the information they keep in a more efficient and automated way, in order to promote the reuse of their open data. 

 In datos.gob.es we have a Practical Guide for the publication of open data using APIs where a series of guidelines and good practices are detailed to define and implement this mechanism in an open data portal.


Content prepared by the datos.gob.es team.

calendar icon
Blog

Every day in the world, large amounts of data are generated that constitute an incredible potential for knowledge creation. Much of this data are generated by organisations that make them available to citizens.

It is recommended that the publication of these data in open data portals, such as datos.gob.es, follow the principles that have characterised Open Goverment Data since its origins, that is, that the data be complete, primary, on time, accessible, machine-readable, non-discriminatory, in free formats and with open licenses.

In order to comply with these principles and guarantee the traceability of the data, it is very important to catalogue it and to do so it is necessary to know its life cycle.

Data life cycle

When we speak of "data life cycle" we refer to the different stages through which a data passes from its birth to its end. The data is not a static asset during its life cycle, but passes through different phases, as the following image shows.

Source:El ciclo de Vida del Dato, @FUOC, Marcos Pérez. PID_00246836.

Within administrations, new sources of data are continually being created, and it is necessary to maintain a record that makes it possible to document the flows of information through the various systems within the organizations. To do this, we need to establish what is known as data traceability.

Data traceability is the ability to know the entire life cycle of the data: the exact date and time of extraction, when it was transformed, and when it was loaded from one source environment to another destination. This process is known as Data Linage.

And to know how the data has behaved during its life cycle, we need a series of metadata.

Let's talk about metadata

The most concrete definition of metadata is that they are data about data and serves to provide information about the data we want to use. Metadata consists of information that characterises data, describes its content and structure, the conditions of use, its origin and transformation, among other relevant information. Therefore, they are a fundamental element for knowing the quality of the data.

The etymology of the term metadata also puts us on the track of its meaning. From the Greek meta, "after" and from the plural "data" of the Latin datum "datos", it literally means "beyond data", alluding to data that describes other data.

According to the DMBOK2 framework of DAMA International, there are three types of metadata:

  • Technical metadata: as the name suggests, they provide information about technical details of the data, the systems that store it and the processes that move it between systems.
  • Operational metadata: describes details of data processing and access.
  • Business metadata: focuses primarily on the content and condition of the data and includes details relating to data governance.

As an example, the sets of metadata we need for cataloguing and describing data are contained in the Information Resources Reuse Technical Standard (NTI, in its Spanish acronym) and, among others, contain:

  • Title or name of the data set.
  • Description detailing relevant aspects of the data content.
  • Organism that publishes the data.  For example, Madrid City Council, .
  • Subject, which we must select from the taxonomy of primary sectors.
  • Format of the dataset.
  • Set of labels that best describe the dataset to facilitate its discovery.
  • Periodicity of updating the information.

In addition, if the reference standard for describing metadata allows the inclusion of properties for this purpose, the following information can be added, even if they are not included in the NTI:

  • If there are data that have undergone transformations, it should be commented on which metric has been used.
  • Indicator on the quality of the data. It can be defined using the vocabulary designed for this purpose, Data Quality Vocabulary (DQV)
  • Lineage trace of the data set, i.e. as a family tree of the data where it is explained where each source comes from.

The benefit of cataloguing

As we have seen, thanks to cataloguing by means of metadata, the user is provided with information about where the data has been created, when it has been created, who has created it, and how it has been transformed when it is the object of information flowing between systems being subject to extraction, transformation and loading operations.

In this way, we are providing very valuable information to the user on how the final result has been obtained and thus ensuring that the full traceability of the data to be reused is available.

In particular, correct cataloguing helps us to:

  • Increase confidence in the data, providing a context for it and allowing its quality to be measured.
  • Increase the value of strategic data, such as through the master data that characterises transactional data.
  • Avoid the use of outdated data or data that has reached the end of its life cycle.
  • Reduce the time spent by the user in investigating whether the data they need meets their requirements.

The success of an open data portal lies in having well-described and reliable data, as this is a very important information asset for the generation of knowledge. Good data governance must ensure that the data used to make decisions is truly reliable and for this, proper cataloguing is essential. Cataloguing the data provides answers and offers greater interpretability of the data so that I can understand which data is best to incorporate into my informational analysis.


Content elaborated by David Puig, Graduate in Information and Documentation and responsible for the Master Data and Reference Group at DAMA ESPAÑA

Contents and points of view expressed in this publication are the exclusive responsibility of its author.

calendar icon
Blog

The research data is very valuable, and its permanent access is one of the greatest challenges for all agents involved in the scientific world: research staff, funding agencies, publishers and academic institutions. The long-term conservation of data and the culture of open access are sources of new opportunities for the scientific community. More and more universities and research centers offer repositories with their research data, allowing permanent access to them. Thus, due to the requirements of each academic discipline, the existing repositories are very varied.

The research staff faces every day this universe of multiple repositories, tools, formats ... in which consulting the desired data without a guide requires many resources of time and effort. Re3data.orgis an international registry of research data repositories (Registry of Research Data Repositories) where metadata is collected from repositories specialized in storing research data. Thanks to this compilation work, the research staff, funding organizations, libraries and editors can search and visualize the main repositories of research data, being able to search and faceted views by discipline, subject, country, contents, formats, licenses, language, etc.

The re3data.org registry was born as a joint project of several German organizations, funded by the German Research Foundation (DFG). The official launch took place in May 2013 and the DataBib catalog was subsequently integrated to avoid duplication and confusion due to the existence of two similar parallel registers. The unification project was sponsored by DataCite, an international non-profit organization whose goal is to improve the quality of data citations. In addition, re3data.org collaborates with other Open Science projects such as BioSharing or OpenAIRE.

Multiple publishers, research institutions and funding organizations refer to the re3data.org registry in their editorial policies or guidelines, as the ideal tool for the identification of data repositories. One of the most notable examples is the European Commission (together with Nature and Springer), since it mentions it in the document "Guidelines for Guidelines to the Rules on Open Access to Scientific Publications and Open Access to Research Data in Horizon 2020

Currently, the metadata of the repositories stored are those listed in version 3 of the Metadata Schema for the Description of Research Data Repositories.

The registry identifies and lists nearly 2,000 repositories of research data, which makes re3data.org the largest and most complete repository of data repositories available on the web. Its growth has been constant since its launch, covering a wide range of disciplines.

As regards Spain, and as of December 1, 2017, 23 repositories of research data are cataloged in which Spain participates.

The promotion of open science, the culture of exchange, the reuse of information and open access is found in the foundations of the re3data.org project. And on these solid foundations the tool continues increasing the collected metadata, and therefore the visibility of the research data. Continuing to work on increasing this visibility and enhancing open science is not only essential to guarantee research work based on previous milestones, but it also allows us to exponentially expand the horizons of scientific work.

 
calendar icon