The most popular data conversion and data cleaning tools
Fecha de la noticia: 08-04-2021

When carrying out a data analysis project, it is usual to work with different sources, which in many cases include datasets with Heterogeneous formats and structures that do not always share the same quality. Therefore, one of the first phases in any data analysis process is known as data cleaning.
What is data cleaning?
When we talk about data cleaning, we refer to the set of processes necessary for the preparation and transformation of data from different sources for analysis. As a result of these processes, a homogeneous structure is generated, free of errors and in the appropriate format, which will be actionable in the subsequent stages of analysis. This resulting data set is known as a data mining view.
Cleannig is essential in data processing, as it standardizes and formats it before entering it into the target system, so that we can work with it appropriately.
Among the different processes that make up the data purification phase, the conversion of the data is included, which involves the transformation of the data to a specific format. In this way they can be used by tools that only accept certain formats.
In the market we find many data cleaning tools that also convert them to other formats, although there are also tools that perform each of these tasks exclusively. Here are some examples selected in phase of their popularity, although we invite you to leave in comments any mention of other tools that may be of interest to you.
Top examples of data cleaning tools
Two of the most used tools in the field of data cleaning are Open Refine and Talend Open Studio.
OpenRefine
Functionality:
OpenRefine is a free tool that seeks to improve the quality and structure of the data by correcting common errors such as data duplication, incomplete data or inconsistencies. Thanks to it, users can organize, clean, apply transformations, convert to different formats, and enrich data by using web services and other external data sources.
Main advantages:
One of its main advantages is that it uses the language GREL (Google Refine Expression Language), which allows you to perform advanced debugging tasks by applying a large number of functions using regular expressions. In addition, it allows incorporating additional extensions by accessing functions to georeference information, link data from the DBpedia or other sources, generating data linked in RDF.
Do you want to know more?
- Support materials: In this user manual all aspects of configuring and using Open Refine 3.4.1 are covered, including all functions and features of the interface and this youtube account different video-tutorials.
- Repository: In this GitHub there are the necessary resources so that you can operate OpenRefine from Mac OS, Linux and Windows.
- User community:OpenRefine users can find discussion groups atGoogle, and Stackoverflow.
- Social media:On the Twitter account of @OpenRefine You can find videos, guides, information about the latest news or upcoming events related to OpenRefine.
Talend Open Studio
Functionality:
Talend Open Studio is an open source solution that integrates a set of ETL (Extract, Transform and Load) tools designed to extract, debug and transform data sets for further analysis. As a result, it generates standardized code in Perl and Java that can be reused in different projects.
Main advantages:
This tool stands out for its intuitive interface based on component programming, a technique that consists of concatenating processes with various functionalities through input and output flows.
Do you want to know more?
- Support materials: On Talend's own website you can find different user manuals Y tutorials to discover Talend Studio and its interface, and create a project, along with simple sample jobs.
- Repository: This GitHub contains the Talend Open Studio source files, which files should be used in conjunction with the common code contained in tcommon-studio-se
- User community: OnStackoverflow There are channels where users share their experience and ask different questions.
- Social media: Talend Open Studio has a page of LinkedIn and the perfil@Talend on Twitter, where they share news, experiences and use cases, among others.
Top Examples of Data Conversion Tools
In the case of data conversion, Mr Data Converter, Beautify Converters and Tabula stand out for their popularity.
Mr Data Converter
Functionality:
Mr Data Converter is a web application that allows you to easily convert data found in CSV or Excel to other formats such as CSV, JSON, HTML and XML.
Main advantages:
One of its strengths is that the data is incorporated by copying and pasting on the application interface, without the need to upload any files. The same happens when exporting, where it is enough to copy and paste the generated code. On the negative side of the scale, we find a limitation in the maximum size of the data, which should not exceed 300 MB.
Do you want to know more?
- Repository: You can find information about the license and different materials in this GitHub.
Beautify Converters
Functionality:
Beautify Converters a web application that allows you to convert data to JSON, SQL, CSV or Excel formats, among others. It belongs to the Beautify Tools collection of free online tools.
Main advantages:
As with Mr Data Converter, the user can incorporate the data by copying and pasting on the application interface. You can also do this by uploading the file from a local computer. Unlike the previous tool, it supports a significantly higher number of formats, SQL, YAML or RSS.
Do you want to know more?
- Repository: In this repository GitHub You have information about the license and the rest of the tools in the Beautify Tools collection.
Tabula
Functionality:
Tabula allows you to extract tables from PDF reports -except for those that are only image-, in formats reusable by data analysis and visualization tools.
Main advantages:
Its main advantage is a very simple interface. It will only be necessary to upload the PDF, select the tables that we want to extract and finally select the desired format, Excel, CSV or JSON.
Do you want to know more?
- Support materials: The Junta de Andalucía has developed thistutorialwhere it tells how to upload a PDF file to Tabula and extract the tabular data in CSV format, ready for use in spreadsheets. The process is exemplified with the data setSanitary quality of bathing water.
- Repository: You can download the materials from this link or GitHub.
- Social media: Twitter. Account @TabulaPDF Although it is not very up-to-date, it offers, among others, information on bug fixes and maintenance, guides and comments from users who use this tool.
The following table shows a summary of the tools mentioned above:
The adaptation and conversion of the data can consume a great amount of resources, economic and temporary, of any project. These types of tools help to carry out these activities with agility and efficiency, freeing up data scientists to focus on other activities.
For those who want to know more about these tools and others that can help us during the different phases of data processing, at datos.gob.es we offer you the recently updated report "Data processing and visualization tools". You can see the full report here.
You can see more tools related to this field in the following monographs:
- The most popular data analysis tools
- The most popular data visualisation tools
- The most popular data visualisation libraries and APIs
- The most popular geospatial visualisation tools
- The most popular network analysis tools
Content prepared by the datos.gob.es team.