Free tools to work on data quality issues
Fecha de la noticia: 11-10-2022

Ensuring data quality is an essential task for any open data initiative. Before publication, datasets need to be validated to check that they are free of errors, duplication, etc. In this way, their potential for re-use will grow.
Data quality is conditioned by many aspects. In this sense, the Aporta Initiative has developed the "Practical guide for improving the quality of open data", which provides a compendium of guidelines to act on the different characteristics that define quality and promote its improvement.
The guide includes a list of some free tools aimed at applying corrective measures on the data at source. In this article we share some examples. These tools are useful for working on specific aspects related to quality, so their usefulness will depend on the data you are working with and their characteristics.
This is a collection of online tools for format conversion and character encoding tasks. You can choose between different converters, but we highlight the tools for working with UTF8 encoding. This collection compiles a wide catalogue of programming tools, which offer conversion functionalities, encryption, password generation, editing and management of texts and images, date and time conversion, mathematical operations, etc.
All tools are free, without intrusive ads, and easy to use thanks to a simple user interface. In addition, each of them includes examples of use.
Managed by the Open Data Institute, this online tool allows you to check whether a CSV file is machine-readable and to verify the columns and types it should include. It also allows users to add schemas to data files. After analysis, it generates a report with the results and a tag that can be embedded in the data portal from which the evaluated dataset comes.
Although it is very easy to use (just upload the file to be verified and click on the validate button), the website includes a help section. It works well with files up to 100 Mb in size. It also offers a simple manual with guidelines on how to create a CSV file correctly and avoid the most common mistakes.
DenCode offers online encoding and decoding tools. Among the functionalities offered we can find this tool that helps publishers to convert date-type data to ISO 8601 format, which is the international standard that facilitates the homogenisation of this type of data and its interoperability.
The tool is very intuitive, as it is only necessary to type the date and time to be converted in the section provided for this purpose.
XML Escape / Unescape is an open source online tool used for "escaping" or masking special characters in XML and performing the reverse process. The tool removes traces of characters that could be misinterpreted.
As in the previous case, the tool is very intuitive. It is only necessary to copy and paste the fragment to be processed in the editor.
JSONLint is a validator and reformulator for JSON, which allows to check if the code is valid according to that specification. It has an editor where you can write or copy and paste the code, although you can also directly enter a url for validation. JSONLint will analyse the code to find and suggest the correction of errors, explaining the multiple reasons why they may occur. The tool can also be used as a compressor, thus reducing file size.
Its website includes information on good practices when working with the JSON format, as well as information on common errors.
Open Refine is a tool designed for data processing and enrichment: it allows you to clean data, transform their format and extend them with web services and external data. One of its main features is that it uses its own language, GREL (Google Refine Expression Language), which allows advanced debugging tasks to be carried out. It is available in more than 15 languages.
Its website offers several videos explaining how it works. It also has a documentation section with online courses, guides and FAQs. In addition, users can use its large community and discussion groups on Google, Gitter and Stackoverflow to solve doubts and share experiences.
OpenRefine allows to add different extensions. One of them allows transforming tabular data to an RDF schema through a SPARQL point. It allows to work with the following specific formats: TSV, CSV, SV, XLS, XLSX, JSON, XML, RDF as XML and Google sheet. The visual interface guides in the choice of predicates, the definition of data type mappings to RDF and the implementation of complex transformations using the GREL language.
The website includes information on how to use the tool, as well as use cases.
This tool allows JSON schemas to be generated and validated from JSON files. These schemas allow to describe existing data formats, providing clear and readable documentation for both humans and machines.
On the JSON Schema website, there are several training materials available, including examples, and information on different implementations. You can also learn more about JSON schema on their Github profile.
This is an online validation tool for the SHACL specification, the W3C standard for validating RDF graphs against a set of conditions expressed in SHACL. As with the previous tools, it is only necessary to cut and paste the code for its validation.
The tool provides some examples of use. In addition, all the code is available on github.
Swagger is a tool for editing and validating specifications that follow the OpenAPI standard. Although it has a paid version with more features, users can create a free account that will allow them to design APIS documentation in a fast and standardised way. This free version has intelligent error detection and syntax auto-completion functionalities.
Sphinx is an open source software for generating any kind of documentation on data. It allows creating hierarchical content structures and automatic indexes, as well as extending cross-references through semantic markup and automatic links for functions, classes, citations, glossary terms and similar pieces of information. It uses the reStructuredText markup language by default, and can read MyST markdown through third-party extensions.
Through its website you can access a large number of tutorials and guides. It also has a large community of users.
This is open source software for hosting and documenting data semantics, similar to the one above. It aims to simplify the generation of software documentation by automating the creation, version control and hosting of documentation.
It has an extensive tutorial on how to create a documentation project.
This tool allows to convert the words in a text into upper and/or lower case. The user only has to enter a text and the tool converts it into different formats: all uppercase, all lowercase, Title Case (where all important words start with uppercase, while minor terms, such as articles or prepositions, use lowercase) or AP-Style Title Case (where all terms start with uppercase).
This is just an example of some online tools that can help you work on data quality issues. If you want to recommend any other tool, you can leave us a comment or send a message to dinamizacion@datos.gob.es.
Content prepared by the datos.gob.es team.