Towards data quality: datos.gob.es
Fecha de la noticia: 08-06-2017

To provide an objective measure of the quality of the datos.gob.es catalogue, obtain a more in-depth knowledge of the status of the catalogue, compare and contrast the opinions received with respect to the catalogue’s quality and define the lines of action to improve the data on offer, these are currently the four objectives of the quality analysis of the metadata associated with the datasets of the catalogue made during the recent month of April.
Among the main results obtained in this analysis, it is worth noting that 94% of datasets have at least one machine-processable distribution, 43% of the datasets specify how frequently they are updated, 30% of the datasets are available under a Creative Commons License.
As of today, the datos.gob.es portal offers a total of 14,717 data sets - a figure that changes day by day. The report measures the quality of the metadata (applying up to eight variables), the updating of published data and the licenses or terms of use of such data. It also analyses the distribution of datasets (via six variables) and together with the qualitative analysis a series of proposals and lines of action are also included.
Metadata, subject areas, updates and licenses
96.5% of datos.gob.es distributions are accessible and 89% are machine- processable, meaning that 94% of data sets have at least one machine-processable distribution and 77% of the datasets are in structured open formats. Only 6% of the datos.gob.es datasets are in unstructured formats.
These are some of the measurements along with the fact that 43% of the datasets indicate how frequently they are updated and 100% of the data sets show both the date when created and the date of the last update.
As regards licenses, 64% of the datasets are subject to conditions of use drafted by the publishing agency, 30.32% of the datasets have a Creative Commons License and 4.75% an Open Definition License.
In most cases, datos.gob.es datasets are offered under the general data availability conditions laid down by Royal Decree 1495/2011, of October 24, which implements Law 37/2007, of November 16, on the re-use of public sector information for the state public sector. The citation of the source and the non-distortion of data are two fundamental requirements.
This qualitative analysis examined the ten publishing agencies with most datasets (Basque Country and Aragon Communities and CSIC lead the ranking followed by the Town Councils of Málaga and Gijón, the Diputación Foral of Guipúzcoa, Xunta de Galicia, Junta de Castilla and Leon, the Valencian Government and the City of Madrid), the ten thematic areas with most datasets (public sector, society and welfare, economy, demography, environment, education, culture and leisure, public finance, employment and health), as well as the ten most used labels, the geographic coverage of the datasets (81.5% of use) and the languages of the datasets (96.6% of use). The update frequency is also detailed.
Distributions and availability of data
Another indicator taken into account to measure the quality of open data is the number of distributions per error code. Of the 44,279 distributions of the 13,644 datasets analysed, 3.4% of distributions give an error code, there being 13 different types of error-code. The most common is the 401 error code (web-based authentication requires user authentication) followed by 404 (page not found).
Regarding the categorization of distributions, the report analysed the quality of 13,644 data sets, 44,279 data distributions and 62 different formats. The estimated average is for each dataset an average of three distributions. According to the number of formats, 52% of datasets are available in a reusable format, 21% in two reusable formats and 8% in five reusable formats.
Proposals and lines of action
The report proposes a series of lines of action, such as promoting the use of standard licenses and having a text for terms of reuse in a single URL that the bodies can use as a standard license, since when users consult the data catalogue they may find up to 168 different URLs of terms of use or licenses.
It is also proposed to continue encouraging the publication of data by the autonomous regions and the State, as well as contact the bodies responsible to encourage them to update the published information.
For the preparation of this analysis, the methodology used consisted in determining the objective indicators that allow all datasets present in datos.gob.es to be automatically analysed according to the following factors:
- Availability of information
- Metadata provided
- Updates
- License
- Formats of distributions