Open data, a key tool for promoting knowledge and education

Blog

Open solutions, including Open Educational Resources (OER), Open Access to Scientific Information (OA), Free and Open-Source Software (FOSS), and open data, encourage the free flow of information and knowledge, serving as a foundation for addressing global challenges, as reminded by UNESCO.

The United Nations Educational, Scientific and Cultural Organization (UNESCO) recognizes the value of open data in the educational field and believes that its use can contribute to measuring the compliance of the Sustainable Development Goals, especially Goal 4 of Quality Education. Other international organizations also recognize the potential of open data in education. For example, the European Commission has classified the education sector as an area with high potential for open data.

Open data can be used as a tool for education and training in different ways. They can be used to develop new educational materials and to collect and analyze information about the state of the educational system, which can be used to drive improvement.

The global pandemic marked a milestone in the education field, as the use of new technologies became essential in the teaching and learning process, which became entirely virtual for months. Although the benefits of incorporating ICT and open solutions into education, a trend known as Edtech, had been talked about for years, COVID-19 accelerated this process.

Benefits of Using Open Data in the Classroom

In the following infographic, we summarize the benefits of utilizing open data in education and training, from the perspective of both students and educators, as well as administrators of the education system.

There are many datasets that can be used for developing educational solutions. At datos.gob.es, there are more than 6,700 datasets available, which can be supplemented by others used for educational purposes in different fields, such as literature, geography, history, etc.

Many solutions have been developed using open data for these purposes. We gather some of them based on their purpose: firstly, solutions that provide information on the education system to understand its situation and plan new measures, and secondly, those that offer educational material to use in the classroom.

In essence, open data is a key tool for the strengthening and progress of education, and we must not forget that education is a universal right and one of the main tools for the progress of humanity.

Accesible version

12/05/2023

Analysis of meteorological data using the "ggplot2" library

Documentación

1. Introduction

Visualizations are graphical representations of data that allow the information linked to them to be communicated in a simple and effective way. The visualization possibilities are very wide, from basic representations, such as a line chart, bars or sectors, to visualizations configured on dashboards or interactive dashboards.

In this "Step-by-Step Visualizations" section we are regularly presenting practical exercises of open data visualizations available on datos.gob.es or similar catalogs. They address and describe in a simple way the stages necessary to obtain the data, perform the transformations and analysis that are relevant to and finally, the creation of interactive visualizations; from which we can extract information summarized in final conclusions. In each of these practical exercises, simple and well-documented code developments are used, as well as free to use tools. All generated material is available for reuse in GitHub's Data Lab repository.

Run the data pre-processing code on top of Google Colab.

Below, you can access the material that we will use in the exercise and that we will explain and develop in the following sections of this post.

Access the data lab repository on Github.

Run the data pre-processing code on top of Google Colab.

2. Objective

The main objective of this exercise is to make an analysis of the meteorological data collected in several stations during the last years. To perform this analysis, we will use different visualizations generated by the "ggplot2" library of the programming language "R".

Of all the Spanish weather stations, we have decided to analyze two of them, one in the coldest province of the country (Burgos) and another in the warmest province of the country (Córdoba), according to data from the AEMET. Patterns and trends in the different records between 1990 and 2020 will be sought to understand the meteorological evolution suffered in this period of time.

Once the data has been analyzed, we can answer questions such as those shown below:

What is the trend in the evolution of temperatures in recent years?
What is the trend in the evolution of rainfall in recent years?
Which weather station (Burgos or Córdoba) presents a greater variation of climatological data in recent years?
What degree of correlation is there between the different climatological variables recorded?

These, and many other questions can be solved by using tools such as ggplot2 that facilitate the interpretation of data through interactive visualizations.

3. Resources

3.1. Datasets

The datasets contain different meteorological information of interest for the two stations in question broken down by year. Within the AEMET download center, we can download them, upon request of the API key, in the section "monthly / annual climatologies". From the existing weather stations, we have selected two of which we will obtain the data: Burgos airport (2331) and Córdoba airport (5402)

It should be noted that, along with the datasets, we can also download their metadata, which are of special importance when identifying the different variables registered in the datasets.

These datasets are also available in the Github repository.

3.2. Tools

To carry out the data preprocessing tasks, the R programming language written on a Jupyter Notebook hosted in the Google Colab cloud service has been used.

"Google Colab" or, also called Google Colaboratory, is a cloud service from Google Research that allows you to program, execute and share code written in Python or R on a Jupyter Notebook from your browser, so it does not require configuration. This service is free of charge.

For the creation of the visualizations, the ggplot2 library has been used.

"ggplot2" is a data visualization package for the R programming language. It focuses on the construction of graphics from layers of aesthetic, geometric and statistical elements. ggplot2 offers a wide range of high-quality statistical charts, including bar charts, line charts, scatter plots, box and whisker charts, and many others.

If you want to know more about tools that can help you in the treatment and visualization of data, you can use the report "Data processing and visualization tools".

4. Data processing or preparation

The processes that we describe below you will find them commented in the Notebook that you can also run from Google Colab.

Before embarking on building an effective visualization, we must carry out a prior treatment of the data, paying special attention to obtaining them and validating their content, ensuring that they are in the appropriate and consistent format for processing and that they do not contain errors.

As a first step of the process, once the necessary libraries have been imported and the datasets loaded, it is necessary to perform an exploratory analysis of the data (EDA) in order to properly interpret the starting data, detect anomalies, missing data or errors that could affect the quality of the subsequent processes and results. If you want to know more about this process, you can resort to the Practical Guide of Introduction to Exploratory Data Analysis.

The next step is to generate the preprocessed data tables that we will use in the visualizations. To do this, we will filter the initial data sets and calculate the values that are necessary and of interest for the analysis carried out in this exercise.

Once the preprocessing is finished, we will obtain the data tables "datos_graficas_C" and "datos_graficas_B" which we will use in the next section of the Notebook to generate the visualizations.

The structure of the Notebook in which the steps previously described are carried out together with explanatory comments of each of them, is as follows:

Installation and loading of libraries.
Loading datasets
Exploratory Data Analysis (EDA)
Preparing the data tables
Views
Saving graphics

You will be able to reproduce this analysis, as the source code is available in our GitHub account. The way to provide the code is through a document made on a Jupyter Notebook that once loaded into the development environment you can run or modify easily. Due to the informative nature of this post and in order to favor the understanding of non-specialized readers, the code is not intended to be the most efficient but to facilitate its understanding so you will possibly come up with many ways to optimize the proposed code to achieve similar purposes. We encourage you to do so!

5. Visualizations

Various types of visualizations and graphs have been made to extract information on the tables of preprocessed data and answer the initial questions posed in this exercise. As mentioned previously, the R "ggplot2" package has been used to perform the visualizations.

The "ggplot2" package is a data visualization library in the R programming language. It was developed by Hadley Wickham and is part of the "tidyverse" package toolkit. The "ggplot2" package is built around the concept of "graph grammar", which is a theoretical framework for building graphs by combining basic elements of data visualization such as layers, scales, legends, annotations, and themes. This allows you to create complex, custom data visualizations with cleaner, more structured code.

If you want to have a summary view of the possibilities of visualizations with ggplot2, see the following "cheatsheet". You can also get more detailed information in the following "user manual".

5.1. Line charts

Line charts are a graphical representation of data that uses points connected by lines to show the evolution of a variable in a continuous dimension, such as time. The values of the variable are represented on the vertical axis and the continuous dimension on the horizontal axis. Line charts are useful for visualizing trends, comparing evolutions, and detecting patterns.

Next, we can visualize several line graphs with the temporal evolution of the values of average, minimum and maximum temperatures of the two meteorological stations analyzed (Córdoba and Burgos). On these graphs, we have introduced trend lines to be able to observe their evolution in a visual and simple way.

To compare the evolutions, not only visually through the graphed trend lines, but also numerically, we obtain the slope coefficients of the trend line, that is, the change in the response variable (tm_ month, tm_min, tm_max) for each unit of change in the predictor variable (year).

Average temperature slope coefficient Córdoba: 0.036
Average temperature slope coefficient Burgos: 0.025
Coefficient of slope minimum temperature Córdoba: 0.020
Coefficient of slope minimum temperature Burgos: 0.020
Slope coefficient maximum temperature Córdoba: 0.051
Slope coefficient maximum temperature Burgos: 0.030

We can interpret that the higher this value, the more abrupt the average temperature rise in each observed period.

Finally, we have created a line graph for each weather station, in which we jointly visualize the evolution of average, minimum and maximum temperatures over the years.

The main conclusions obtained from the visualizations of this section are:

The average, minimum and maximum annual temperatures recorded in Córdoba and Burgos have an increasing trend.
The most significant increase is observed in the evolution of the maximum temperatures of Córdoba (slope coefficient = 0.051)
The slightest increase is observed in the evolution of the minimum temperatures, both in Córdoba and Burgos (slope coefficient = 0.020)

5.2. Bar charts

Bar charts are a graphical representation of data that uses rectangular bars to show the magnitude of a variable in different categories or groups. The height or length of the bars represents the amount or frequency of the variable, and the categories are represented on the horizontal axis. Bar charts are useful for comparing the magnitude of different categories and for visualizing differences between them.

We have generated two bar graphs with the data corresponding to the total accumulated precipitation per year for the different weather stations.

As in the previous section, we plot the trend line and calculate the slope coefficient.

Slope coefficient for accumulated rainfall Córdoba: -2.97
Slope coefficient for accumulated rainfall Burgos: -0.36

The main conclusions obtained from the visualizations of this section are:

The annual accumulated rainfall has a decreasing trend for both Córdoba and Burgos.
The downward trend is greater for Córdoba (coefficient = -2.97), being more moderate for Burgos (coefficient = -0.36)

5.3. Histograms

Histograms are a graphical representation of a frequency distribution of numeric data in a range of values. The horizontal axis represents the values of the data divided into intervals, called "bin", and the vertical axis represents the frequency or amount of data found in each "bin". Histograms are useful for identifying patterns in data, such as distribution, dispersion, symmetry, or bias.

We have generated two histograms with the distributions of the data corresponding to the total accumulated precipitation per year for the different meteorological stations, being the chosen intervals of 50 mm3.

The main conclusions obtained from the visualizations of this section are:

The records of annual accumulated precipitation in Burgos present a distribution close to a normal and symmetrical distribution.
The records of annual accumulated precipitation in Córdoba do not present a symmetrical distribution.

5.4. Box and whisker diagrams

Box and whisker diagrams are a graphical representation of the distribution of a set of numerical data. These graphs represent the median, interquartile range, and minimum and maximum values of the data. The chart box represents the interquartile range, that is, the range between the first and third quartiles of the data. Out-of-the-box points, called outliers, can indicate extreme values or anomalous data. Box plots are useful for comparing distributions and detecting extreme values in your data.

We have generated a graph with the box diagrams corresponding to the accumulated rainfall data from the weather stations.

To understand the graph, the following points should be highlighted:

The boundaries of the box indicate the first and third quartiles (Q1 and Q3), which leave below each, 25% and 75% of the data respectively.
The horizontal line inside the box is the median (equivalent to the second quartile Q2), which leaves half of the data below.
The whisker limits are the extreme values, that is, the minimum value and the maximum value of the data series.
The points outside the whiskers are the outliers.

The main conclusions obtained from the visualization of this section are:

Both distributions present 3 extreme values, being significant those of Córdoba with values greater than 1000 mm3.
The records of Córdoba have a greater variability than those of Burgos, which are more stable

5.5. Pie charts

A pie chart is a type of pie chart that represents proportions or percentages of a whole. It consists of several sections or sectors, where each sector represents a proportion of the whole set. The size of the sector is determined based on the proportion it represents, and is expressed in the form of an angle or percentage. It is a useful tool for visualizing the relative distribution of the different parts of a set and facilitates the visual comparison of the proportions between the different groups.

We have generated two graphs of (polar) sectors. The first of them with the number of days that the values exceed 30º in Córdoba and the second of them with the number of days that the values fall below 0º in Burgos.

For the realization of these graphs, we have grouped the sum of the number of days described above into six groups, corresponding to periods of 5 years from 1990 to 2020.

The main conclusions obtained from the visualizations of this section are:

There is an increase of 31.9% in the total number of annual days with temperatures above 30º in Córdoba for the period between 2015-2020 compared to the period 1990-1995.
There is an increase of 33.5% in the total number of annual days with temperatures above 30º in Burgos for the period between 2015-2020 compared to the period 1990-1995.

5.6. Scatter plots

Scatter plots are a data visualization tool that represent the relationship between two numerical variables by locating points on a Cartesian plane. Each dot represents a pair of values of the two variables and its position on the graph indicates how they relate to each other. Scatter plots are commonly used to identify patterns and trends in data, as well as to detect any possible correlation between variables. These charts can also help identify outliers or data that doesn't fit the overall trend.

We have generated two scattering plots in which the values of maximum average temperatures and minimum averages are compared, looking for correlation trends between them for the values of each weather station.

To analyze the correlations, not only visually through graphs, but also numerically, we obtain Pearson's correlation coefficients. This coefficient is a statistical measure that indicates the degree of linear association between two quantitative variables. It is used to assess whether there is a positive linear relationship (both variables increase or decrease simultaneously at a constant rate), negative (the values of both variables vary oppositely) or null (no relationship) between two variables and the strength of such a relationship, the closer to +1, the higher their association.

Pearson coefficient (Average temperature max VS min) Córdoba: 0.15
Pearson coefficient (Average temperature max VS min) Burgos: 0.61

In the image we observe that while in Córdoba a greater dispersion is appreciated, in Burgos a greater correlation is observed.

Next, we will modify the previous scatter plots so that they provide us with more information visually. To do this, we divide the space by colored sectors (red with higher temperature values / blue lower temperature values) and show in the different bubbles the label with the corresponding year. It should be noted that the color change limits of the quadrants correspond to the average values of each of the variables.

The main conclusions obtained from the visualizations of this section are:

There is a positive linear relationship between the average maximum and minimum temperature in both Córdoba and Burgos, this correlation being greater in the Burgos data.
The years with the highest values of maximum and minimum temperatures in Burgos are (2003, 2006 and 2020)
The years with the highest values of maximum and minimum temperatures in Córdoba are (1995, 2006 and 2020)

5.7. Correlation matrix

The correlation matrix is a table that shows the correlations between all variables in a dataset. It is a square matrix that shows the correlation between each pair of variables on a scale ranging from -1 to 1. A value of -1 indicates a perfect negative correlation, a value of 0 indicates no correlation, and a value of 1 indicates a perfect positive correlation. The correlation matrix is commonly used to identify patterns and relationships between variables in a dataset, which can help to better understand the factors that influence a phenomenon or outcome.

We have generated two heat maps with the correlation matrix data for both weather stations.

The main conclusions obtained from the visualizations of this section are:

There is a strong negative correlation (-0.42) for Córdoba and (-0.45) for Burgos between the number of annual days with temperatures above 30º and accumulated rainfall. This means that as the number of days with temperatures above 30º increases, precipitation decreases significantly.

6. Conclusions of the exercise

Data visualization is one of the most powerful mechanisms for exploiting and analyzing the implicit meaning of data. As we have seen in this exercise, "ggplot2" is a powerful library capable of representing a wide variety of graphics with a high degree of customization that allows you to adjust numerous characteristics of each graph.

After analyzing the previous visualizations, we can conclude that both for the weather station of Burgos, as well as that of Córdoba, temperatures (minimum, average, maximum) have suffered a considerable increase, days with extreme heat (temperature > 30º) have also suffered and rainfall has decreased in the period of time analyzed, from 1990 to 2020.

We hope that this step-by-step visualization has been useful for learning some very common techniques in the treatment, representation and interpretation of open data. We will be back to show you new reuses. See you soon!

12/04/2023

Generating personalized tourist map with "Google My Maps"

Documentación

1. Introduction

Visualizations are graphical representations of the data allowing to transmit in a simple and effective way related information. The visualization capabilities are extensive, from basic representations, such as a line chart, bars or sectors, to visualizations configured on control panels or interactive dashboards.

In this "Step-by-Step Visualizations" section we are periodically presenting practical exercises of open data visualizations available in datos.gob.es or other similar catalogs. They address and describe in an easy manner stages necessary to obtain the data, to perform transformations and analysis relevant to finally creating interactive visualizations, from which we can extract information summarized in final conclusions. In each of these practical exercises simple and well-documented code developments are used, as well as open-source tools. All generated materials are available for reuse in the GitHub repository.

In this practical exercise, we made a simple code development that is conveniently documented relying on free to use tools.

Access the data lab repository on Github

Run the data pre-procesing code on top of Google Colab

2. Objective

The main scope of this post is to show how to generate a custom Google Maps map using the "My Maps" tool based on open data. These types of maps are highly popular on websites, blogs and applications in the tourism sector, however, the useful information provided to the user is usually scarce.

In this exercise, we will use potential of the open-source data to expand the information to be displayed on our map in an automatic way. We will also show how to enrich open data with context information that significantly improves the user experience.

From a functional point of view, the goal of the exercise is to create a personalized map for planning tourist routes through the natural areas of the autonomous community of Castile and León. For this, open data sets published by the Junta of Castile and León have been used, which we have pre-processed and adapted to our needs in order to generate a personalized map.

3. Resources

3.1. Datasets

The datasets contain different tourist information of geolocated interest. Within the open data catalog of the Junta of Castile and León, we may find the "dictionary of entities" (additional information section), a document of vital importance, since it defines the terminology used in the different data sets.

These datasets are also available in the Github repository.

3.2. Tools

To carry out the data preprocessing tasks, the Python programming language written on a Jupyter Notebook hosted in the Google Colab cloud service has been used.

"Google Colab" also called " Google Colaboratory", is a free cloud service from Google Research that allows you to program, execute and share from your browser code written in Python or R, so it does not require installation of any tool or configuration.

For the creation of the interactive visualization, the Google My Maps tool has been used.

"Google My Maps" is an online tool that allows you to create interactive maps that can be embedded in websites or exported as files. This tool is free, easy to use and allows multiple customization options.

If you want to know more about tools that can help you with the treatment and visualization of data, you can go to the section "Data processing and visualization tools".

4. Data processing and preparation

The processes that we describe below are commented in the Notebook which you can run from Google Colab.

Before embarking on building an effective visualization, we must carry out a prior data treatment, paying special attention to obtaining them and validating their content, ensuring that they are in the appropriate and consistent format for processing and that they do not contain errors.

The first step necessary is performing the exploratory analysis of the data (EDA) in order to properly interpret the starting data, detect anomalies, missing data or errors that could affect the quality of the subsequent processes and results. If you want to know more about this process, you can go to the Practical Guide of Introduction to Exploratory Data Analysis.

The next step is to generate the tables of preprocessed data that will be used to feed the map. To do so, we will transform the coordinate systems, modify and filter the information according to our needs.

The steps required in this data preprocessing, explained in the Notebook, are as follows:

Installation and loading of libraries
Loading datasets
Exploratory Data Analysis (EDA)
Preprocessing of datasets

During the preprocessing of the data tables, it is necessary to change the coordinate system since in the source datasets the ESTR89 (standard system used in the European Union) is used, while we will need them in the WGS84 (system used by Google My Maps among other geographical applications). How to make this coordinate change is explained in the Notebook. If you want to know more about coordinate types and systems, you can use the "Spatial Data Guide".

Once the preprocessing is finished, we will obtain the data tables "recreational_natural_parks.csv", "rural_accommodations_2stars.csv", "natural_park_shelters.csv", "observatories_natural_parks.csv", "viewpoints_natural_parks.csv", "park_houses.csv", "trees_natural_parks.csv" which include generic and common information fields such as: name, observations, geolocation,... together with specific information fields, which are defined in details in section "6.2 Personalization of the information to be displayed on the map".

You will be able to reproduce this analysis, as the source code is available in our GitHub account. The code can be provided through a document made on a Jupyter Notebook once loaded into the development environment can be easily run or modified. Due to informative nature of this post and to favor understanding of non-specialized readers, the code is not intended to be the most efficient, but rather to facilitate its understanding so you could possibly come up with many ways to optimize the proposed code to achieve similar purposes. We encourage you to do so!

5. Data enrichment

To provide more related information, a data enrichment process is carried out on the dataset "hotel accommodation registration" explained below. With this step we will be able to automatically add complementary information that was initially not included. With this, we will be able to improve the user experience during their use of the map by providing context information related to each point of interest.

For this we will apply a useful tool for such kind of a tasks: OpenRefine. This open-source tool allows multiple data preprocessing actions, although this time we will use it to carry out an enrichment of our data by incorporating context by automatically linking information that resides in the popular Wikidata knowledge repository.

Once the tool is installed on our computer, when executed – a web application will open in the browser in case it is not opened automatically.

Here are the steps to follow.

Step 1

Loading the CSV into the system (Figure 1). In this case, the dataset "Hotel accommodation registration".

Figure 1. Uploading CSV file to OpenRefine

Step 2

Creation of the project from the uploaded CSV (Figure 2). OpenRefine is managed by projects (each uploaded CSV will be a project), which are saved on the computer where OpenRefine is running for possible later use. In this step we must assign a name to the project and some other data, such as the column separator, although the most common is that these last settings are filled automatically.

Figure 2. Creating a project in OpenRefine

Step 3

Linked (or reconciliation, using OpenRefine nomenclature) with external sources. OpenRefine allows us to link resources that we have in our CSV with external sources such as Wikidata. To do this, the following actions must be carried out:

Identification of the columns to be linked. Usually, this step is based on the analyst experience and knowledge of the data that is represented in Wikidata. As a hint, generically you can reconcile or link columns that contain more global or general information such as country, streets, districts names etc., and you cannot link columns like geographical coordinates, numerical values or closed taxonomies (types of streets, for example). In this example, we have the column "municipalities" that contains the names of the Spanish municipalities.
Beginning of reconciliation (Figure 3). We start the reconciliation and select the default source that will be available: Wikidata. After clicking Start Reconciling, it will automatically start searching for the most suitable Wikidata vocabulary class based on the values in our column.
Obtaining the values of reconciliation. OpenRefine offers us an option of improving the reconciliation process by adding some features that allow us to conduct the enrichment of information with greater precision.

Figure 3. Selecting the class that best represents the values in the "municipality"

Step 4

Generate a new column with the reconciled or linked values (Figure 4). To do this we need to click on the column "municipality" and go to "Edit Column → Add column based in this column", where a text will be displayed in which we will need to indicate the name of the new column (in this example it could be "wikidata"). In the expression box we must indicate: "http://www.wikidata.org/ entity/"+cell.recon.match.id and the values appear as previewed in the Figure. "http://www.wikidata.org/entity/" is a fixed text string to represent Wikidata entities, while the reconciled value of each of the values is obtained through the cell.recon.match.id statement, that is, cell.recon.match.id("Adanero") = Q1404668

Thanks to the abovementioned operation, a new column will be generated with those values. In order to verify that it has been executed correctly, we click on one of the cells in the new column which should redirect to the Wikidata webpage with reconciled value information.

Figure 4. Generating a new column with reconciled values

Step 5

We repeat the process by changing in step 4 the "Edit Column → Add column based in this column" with "Add columns from reconciled values" (Figure 5). In this way, we can choose the property of the reconciled column.

In this exercise we have chosen the "image" property with identifier P18 and the "population" property with identifier P1082. Nevertheless, we could add all the properties that we consider useful, such as the number of inhabitants, the list of monuments of interest, etc. It should be mentioned that just as we enrich data with Wikidata, we can do so with other reconciliation services.

Figura 5. Choice of property for reconciliation

In the case of the "image" property, due to the display, we want the value of the cells to be in the form of a link, so we have made several adjustments. These adjustments have been the generation of several columns according to the reconciled values, adequacy of the columns through commands in GREL language (OpenRefine''s own language) and union of the different values of both columns. You can check these settings and more techniques to improve your handling of OpenRefine and adapt it to your needs in the following User Manual.

6. Map visualization

6.1 Map generation with "Google My Maps"

To generate the custom map using the My Maps tool, we have to execute the following steps:

We log in with a Google account and go to "Google My Maps", with free access with no need to download any kind of software.
We import the preprocessed data tables, one for each new layer we add to the map. Google My Maps allows you to import CSV, XLSX, KML and GPX files (Figure 6), which should include associated geographic information. To perform this step, you must first create a new layer from the side options menu.

Figure 6. Importing files into "Google My Maps"

In this case study, we''ll import preprocessed data tables that contain one variable with latitude and other with longitude. This geographic information will be automatically recognized. My Maps also recognizes addresses, postal codes, countries, ...

Figura 7. Select columns with placement values

With the edit style option in the left side menu, in each of the layers, we can customize the pins, editing their color and shape.

Figure 8. Position pin editing

Finally, we can choose the basemap that we want to display at the bottom of the options sidebar.

Figura 9. Basemap selection

If you want to know more about the steps for generating maps with "Google My Maps", check out the following step-by-step tutorial.

6.2 Personalization of the information to be displayed on the map

During the preprocessing of the data tables, we have filtered the information according to the focus of the exercise, which is the generation of a map to make tourist routes through the natural spaces of Castile and León. The following describes the customization of the information that we have carried out for each of the datasets.

In the dataset belonging to the singular trees of the natural areas, the information to be displayed for each record is the name, observations, signage and position (latitude / longitude)
In the set of data belonging to the houses of the natural areas park, the information to be displayed for each record is the name, observations, signage, access, web and position (latitude / longitude)
In the set of data belonging to the viewpoints of the natural areas, the information to be displayed for each record is the name, observations, signage, access and position (latitude / longitude)
In the dataset belonging to the observatories of natural areas, the information to be displayed for each record is the name, observations, signaling and position (latitude / longitude)
In the dataset belonging to the shelters of the natural areas, the information to be displayed for each record is the name, observations, signage, access and position (latitude / longitude). Since shelters can be in very different states and that some records do not offer information in the "observations" field, we have decided to filter to display only those that have information in that field.
In the set of data belonging to the recreational areas of the natural park, the information to be displayed for each record is the name, observations, signage, access and position (latitude / longitude). We have decided to filter only those that have information in the "observations" and "access" fields.
In the set of data belonging to the accommodations, the information to be displayed for each record is the name, type of establishment, category, municipality, web, telephone and position (latitude / longitude). We have filtered the "type" of establishment only those that are categorized as rural tourism accommodations and those that have 2 stars.

Following a visualization of the custom map we have created is returned. By selecting the icon to enlarge the map that appears in the upper right corner, you can access its full-screen display

6.3 Map functionalities (layers, pins, routes and immersive 3D view)

At this point, once the custom map is created, we will explain various functionalities offered by "Google My Maps" during the visualization of the data.

Layers

Using the drop-down menu on the left, we can activate and deactivate the layers to be displayed according to our needs.

Figure 10. Layers in "My Maps"

Pins

By clicking on each of the pins of the map we can access the information associated with that geographical position.

Figure 11. Pins in "My Maps"

Routes

We can create a copy of the map on which to add our personalized tours.

In the options of the left side menu select "copy map". Once the map is copied, using the add directions symbol, located below the search bar, we will generate a new layer. To this layer we can indicate two or more points, next to the means of transport and it will create the route next to the route indications.

Figure 12. Routes in "My Maps"

3D immersive map

Through the options symbol that appears in the side menu, we can access Google Earth, from where we can explore the immersive map in 3D, highlighting the ability to observe the altitude of the different points of interest. You can also access through the following link.

Figure 13. 3D immersive view

7. Conclusions of the exercise

Data visualization is one of the most powerful mechanisms for exploiting and analyzing the implicit meaning of data. It is worth highlighting the vital importance that geographical data have in the tourism sector, which we have been able to verify in this exercise.

As a result, we have developed an interactive map with information provided by Linked Data, which we have customized according to our interests.

We hope that this step-by-step visualization has been useful for learning some very common techniques in the treatment and representation of open data. We will be back to show you new reuses. See you soon!

09/03/2023

Analysis of toxicological findings in road traffic accidents

Documentación

1. Introduction

Visualizations are graphical representations of data that allows comunication in a simple and effective way the information linked to it. The visualization possibilities are very wide, from basic representations, such as a graph of lines, bars or sectors, to visualizations configured on dashboards or interactive dashboards. Visualizations play a fundamental role in drawing conclusions using visual language, also allowing to detect patterns, trends, anomalous data or project predictions, among many other functions.

In this section of "Step-by-Step Visualizations" we are periodically presenting practical exercises of open data visualizations available in datos.gob.es or other similar catalogs. They address and describe in a simple way the necessary stages to obtain the data, perform the transformations and analysis that are relevant to it and finally, the creation of interactive visualizations. From these visualizations we can extract information to summarize in the final conclusions. In each of these practical exercises, simple and well-documented code developments are used, as well as free to use tools. All generated material is available for reuse in the Github data lab repository belonging to datos.gob.es.

In this practical exercise, we have carried out a simple code development that is conveniently documented based on free to use tool.

Access the data lab repository on Github.

Run the data pre-processing code on Google Colab.

2. Objetive

The main objective of this post is to show how to make an interactive visualization based on open data. For this practical exercise we have used a dataset provided by the Ministry of Justice that contains information about the toxicological results made after traffic accidents that we will cross with the data published by the Central Traffic Headquarters (DGT) that contain the detail on the fleet of vehicles registered in Spain.

From this data crossing we will analyze and be able to observe the ratios of positive toxicological results in relation to the fleet of registered vehicles.

It should be noted that the Ministry of Justice makes available to citizens various dashboards to view data on toxicological results in traffic accidents. The difference is that this practical exercise emphasizes the didactic part, we will show how to process the data and how to design and build the visualizations.

3. Resources

3.1. Datasets

For this case study, a dataset provided by the Ministry of Justice has been used, which contains information on the toxicological results carried out in traffic accidents. This dataset is in the following Github repository:

Dataset toxicological results in accidents

The datasets of the fleet of vehicles registered in Spain have also been used. These data sets are published by the Central Traffic Headquarters (DGT), an agency under the Ministry of the Interior. They are available on the following page of the datos.gob.es Data Catalog:

Registered vehicle fleet datasets

3.2. Tools

To carry out the data preprocessing tasks it has been used the Python programming language written on a Jupyter Notebook hosted in the Google Colab cloud service.

Google Colab (also called Google Colaboratory), is a free cloud service from Google Research that allows you to program, execute and share code written in Python or R from your browser, so it does not require the installation of any tool or configuration.

For the creation of the interactive visualization, the Google Data Studio tool has been used.

Google Data Studio is an online tool that allows you to make graphs, maps or tables that can be embedded in websites or exported as files. This tool is simple to use and allows multiple customization options.

If you want to know more about tools that can help you in the treatment and visualization of data, you can use the report "Data processing and visualization tools".

4. Data processing or preparation

Before launching to build an effective visualization, we must carry out a previous treatment of the data, paying special attention to obtaining it and validating its content, ensuring that it is in the appropriate and consistent format for processing and that it does not contain errors.

The processes that we describe below will be discussed in the Notebook that you can also run from Google Colab. Link to Google Colab notebook

As a first step of the process, it is necessary to perform an exploratory data analysis (EDA) in order to properly interpret the starting data, detect anomalies, missing data or errors that could affect the quality of subsequent processes and results. Pre-processing of data is essential to ensure that analyses or visualizations subsequently created from it are reliable and consistent. If you want to know more about this process, you can use the Practical Guide to Introduction to Exploratory Data Analysis.

The next step to take is the generation of the preprocessed data tables that we will use to generate the visualizations. To do this we will adjust the variables, cross data between both sets and filter or group as appropriate.

The steps followed in this data preprocessing are as follows:

Importing libraries
Loading data files to use
Detection and processing of missing data (NAs)
Modifying and adjusting variables
Generating tables with preprocessed data for visualizations
Storage of tables with preprocessed data

You will be able to reproduce this analysis since the source code is available in our GitHub account. The way to provide the code is through a document made on a Jupyter Notebook that once loaded into the development environment you can execute or modify easily. Due to the informative nature of this post and favor the understanding of non-specialized readers, the code does not intend to be the most efficient, but to facilitate its understanding, so you will possibly come up with many ways to optimize the proposed code to achieve similar purposes. We encourage you to do so!

5. Generating visualizations

Once we have done the preprocessing of the data, we go with the visualizations. For the realization of these interactive visualizations, the Google Data Studio tool has been used. Being an online tool, it is not necessary to have software installed to interact or generate any visualization, but it is necessary that the data tables that we provide are properly structured, for this we have made the previous steps for the preparation of the data.

The starting point is the approach of a series of questions that visualization will help us solve. We propose the following:

How is the fleet of vehicles in Spain distributed by Autonomous Communities?
What type of vehicle is involved to a greater and lesser extent in traffic accidents with positive toxicological results?
Where are there more toxicological findings in traffic fatalities?

Let''s look for the answers by looking at the data!

5.1. Fleet of vehicles registered by Autonomous Communities

This visual representation has been made considering the number of vehicles registered in the different Autonomous Communities, breaking down the total by type of vehicle. The data, corresponding to the average of the month-to-month records of the years 2020 and 2021, are stored in the "parque_vehiculos.csv" table generated in the preprocessing of the starting data.

Through a choropleth map we can visualize which CCAAs are those that have a greater fleet of vehicles. The map is complemented by a ring graph that provides information on the percentages of the total for each Autonomous Community.

As defined in the "Data visualization guide of the Generalitat Catalana" the choropletic (or choropleth) maps show the values of a variable on a map by painting the areas of each affected region of a certain color. They are used when you want to find geographical patterns in the data that are categorized by zones or regions.

Ring charts, encompassed in pie charts, use a pie representation that shows how the data is distributed proportionally.

Once the visualization is obtained, through the drop-down tab, the option to filter by type of vehicle appears.

View full screen visualization

5.2. Ratio of positive toxicological results for different types of vehicles

This visual representation has been made considering the ratios of positive toxicological results by number of vehicles nationwide. We count as a positive result each time a subject tests positive in the analysis of each of the substances, that is, the same subject can count several times in the event that their results are positive for several substances. For this purpose, the table "resultados_vehiculos.csv” has been generated during data preprocessing.

Using a stacked bar chart, we can evaluate the ratios of positive toxicological results by number of vehicles for different substances and different types of vehicles.

As defined in the "Data visualization guide of the Generalitat Catalana" bar graphs are used when you want to compare the total value of the sum of the segments that make up each of the bars. At the same time, they offer insight into how large these segments are.

When stacked bars add up to 100%, meaning that each segmented bar occupies the height of the representation, the graph can be considered a graph that allows you to represent parts of a total.

The table provides the same information in a complementary way.

Once the visualization is obtained, through the drop-down tab, the option to filter by type of substance appears.

View full screen visualization

5.3. Ratio of positive toxicological results for the Autonomous Communities

This visual representation has been made taking into account the ratios of the positive toxicological results by the fleet of vehicles of each Autonomous Community. We count as a positive result each time a subject tests positive in the analysis of each of the substances, that is, the same subject can count several times in the event that their results are positive for several substances. For this purpose, the "resultados_ccaa.csv" table has been generated during data preprocessing.

It should be noted that the Autonomous Community of registration of the vehicle does not have to coincide with the Autonomous Community where the accident has been registered, however, since this is a didactic exercise and it is assumed that in most cases they coincide, it has been decided to start from the basis that both coincide.

Through a choropleth map we can visualize which CCAAs are the ones with the highest ratios. To the information provided in the first visualization on this type of graph, we must add the following.

As defined in the "Data Visualization Guide for Local Entities" one of the requirements for choropleth maps is to use a numerical measure or datum, a categorical datum for the territory, and a polygon geographic datum.

The table and bar chart provides the same information in a complementary way.

Once the visualization is obtained, through the peeling tab, the option to filter by type of substance appears.

View full screen visualization

6. Conclusions of the study

Data visualization is one of the most powerful mechanisms for exploiting and analyzing the implicit meaning of data, regardless of the type of data and the degree of technological knowledge of the user. Visualizations allow us to build meaning on top of data and create narratives based on graphical representation. In the set of graphical representations of data that we have just implemented, the following can be observed:

The fleet of vehicles of the Autonomous Communities of Andalusia, Catalonia and Madrid corresponds to about 50% of the country''s total.
The highest positive toxicological results ratios occur in motorcycles, being of the order of three times higher than the next ratio, passenger cars, for most substances.
The lowest positive toxicology result ratios occur in trucks.
Two-wheeled vehicles (motorcycles and mopeds) have higher "cannabis" ratios than those obtained in "cocaine", while four-wheeled vehicles (cars, vans and trucks) have higher "cocaine" ratios than those obtained in "cannabis"
The Autonomous Community where the ratio for the total of substances is highest is La Rioja.

It should be noted that in the visualizations you have the option to filter by type of vehicle and type of substance. We encourage you to do so to draw more specific conclusions about the specific information you''re most interested in.

We hope that this step-by-step visualization has been useful for learning some very common techniques in the treatment and representation of open data. We will return to show you new reuses. See you soon!

18/01/2023

News about the open data ecosystem (autumn 2022)

Noticia

The last few months of the year are always accompanied by numerous innovations in the open data ecosystem. It is the time chosen by many organisations to stage conferences and events to show the latest trends in the field and to demonstrate their progress.

New functionalities and partnerships

Public bodies have continued to make progress in their open data strategies, incorporating new functionalities and data sets at their open data platforms. Examples include:

On 11 November, the Ministry for the Ecological Transition and the Demographic Challenge and The Information Lab Spain presented the SIDAMUN platform (Integrated Municipal Data System). It is a data visualisation tool with interactive dashboards which show detailed information about the current status of the territory.
The Ministry of Agriculture, Food and Fisheries has published four interactive reports to exploit more than 500 million data elements and thus provide information in a simple way about the status and evolution of the Spanish primary sector.
The Open Data Portal of the Regional Government of Andalusia has been updated in order to promote the reuse of information, expanding the possibilities of access through APIs in a more efficient, automated way.
The National Geographic Institute has updated the information on green routes (reconditioned railway lines) which are already available for download in KML, GPX and SHP.
The Institute for Statistics and Cartography of Andalusia has published data on the Natural Movement of the Population for 2021, which provides information on births, marriages and deaths.

We have also seen advances made from a strategic perspective and in terms of partnerships. The Regional Ministry of Participation and Transparency of the Valencian Regional Government set in motion a participatory process to design the first action plan of the 'OGP Local' programme of the Open Government Partnership. In turn, the Government of the Canary Islands has applied for admission to the International Open Government Partnership and it will strengthen collaboration with the local entities of the islands, thereby mainstreaming the Open Government policies.

In addition, various organisations have announced news for the coming months. This is the case of Cordoba City Council which is set to launch in the near future a new portal with open data, or of Torrejon City Council which has included in its local action plan the creation of an Open data portal, as well as the promotion of the use of big data in institutions.

Open data tenders, a showcase for finding talent and new use cases

During the autumn, the winners of various contests were announced which sought to promote the reuse of open data. Thanks to these tenders, we have also learned of numerous cases of reuse which demonstrate open data's capacity to generate social and economic benefits.

At the end of October we met the winners of our “Aporta” Challenge. First prize went to HelpVoice!, a service that seeks to help the elderly using speech recognition techniques based on automatic learning. A web environment to facilitate the analysis and interactive visualisation of microdata from the Hospital Morbidity Survey and an app to promote healthy habits won second and third prizes, respectively.
The winners of the ideas and applications tender of Open Data Euskadi were also announced. The winners include a smart assistant for energy saving and an app to locate free parking spaces.
Aragon Open Data, the open data portal of the Government of Aragon, celebrated its tenth anniversary with a face-to-face datathon to prototype services that help people through portal data. The award for the most innovative solution with the greatest impact went to Certifica-Tec, a website that allows you to geographically view the status of energy efficiency certificates.
The Biscay Open Data Datathon set out to transform Biscay based on its open data. At the end of November, the final event of the Datathon was held. The winner was Argilum, followed by Datoston.
UniversiData launched its first datathon, whose winning projects have just been announced.

In addition, in the last few months other initiatives related with the reuse of data have been announced such as:

Researchers from Technical University of Madrid have carried out a study where they use artificial intelligence algorithms to analyse clinical data on lung cancer patients, scientific publications and open data. The aim is to obtain statistical patterns that allow the treatments to be improved.
The Research Report 2021 that the University of Extremadura has just published was generated automatically from the open data portal. It is a document containing more than 1,200 pages which includes the investigations of all the departments of the centre.
F4map is a 3D map that has been produced thanks to the open data of the OpenStreetMap collaborative community. Hence, and alternating visualisation in 2D and 3D, it offers a detailed view of different cities, buildings and monuments from all around the world.

Dissemination of open data and their use cases through events

One thing autumn has stood out for has been for the staging of events focused on the world of data, many of which were recorded and can be viewed again online. Examples include:

The Ministry of Justice and the University of Salamanca organised the symposium Justice and Law in Data: The role of Data as an enabler and engine for change for the transformation of Justice and Law”. During the event reflections were made on data as a public asset. All the presentations are available on the Youtube channel of the University.
In October Madrid hosted a new edition of the Data Management Summit Spain. The day before there was a prior session, organised in collaboration with DAMA España and the Data Office, aimed exclusively at representatives of the public administration and focused on open data and the exchange of information between administrations. This can be seen on Youtube too.
The Barcelona Provincial Council, the Castellon Provincial Council and the Government of Aragon organised the National Open Data Meeting, with the aim of making clear the importance of the latter in territorial cohesion.
The Iberian Conference on Spatial Data Infrastructure was held in Seville, where geographic information trends were discussed.
A recording of the Associationism Seminars 2030, organised by the Government of the Canary Islands, can also be viewed. As regards the presentations, we would highlight the one related with the ‘Map of Associationism in the Canary Islands' which makes this type of data visible in an interactive way.
ASEDIE organised the 14th edition of its International Conference on the reuse of public sector Information which featured various round tables, including one on 'The Data Economy: rights, obligations, opportunities and barriers'.

Guides and courses

During these months, guides have also been published which seek to help publishers and reusers in their work with open data. From datos.gob.es we have published documents on How to prepare a Plan of measures to promote the opening and reuse of open data, the guide to Introduction to data anonymisation: Techniques and practical cases and the Practical guide for improving the quality of open data. In addition, other organisations have also published help documents such as:

The Regional Government of Valencia has published a guide that compiles transparency obligations established by the Valencian law for public sector entities.
The Spanish Data Protection Agency (AEPD) has translated the Singapore Data Protection Authority’s Guide to Basic Anonymisation, in view of its educational value and special interest to data protection officers. The guide is complemented by a free data anonymisation tool, which the AEPD makes available to organisations
The NETWORK of Local Entities for Transparency and Citizen Participation of the FEMP has just presented the Data visualisation guide for Local Entities, a document with good practices and recommendations. The document refers to a previous work of the City Council of L'Hospitalet.

International news

During this period, we have also seen developments at European level. Some of the ones we are highlighting are:

In October the final of the EUdatathon 2022. The finalist teams were previously selected from a total of 156 initial proposals.
The European Data Portal has launched the initiative Use Case Observatory to measure the impact of open data by monitoring 30 use cases over 3 years.
A group of scientists from the Dutch Institute for Fundamental Energy Research has created a database of 31,618 molecules thanks to algorithms trained with artificial intelligence.
The World Bank has developed a new food and nutrition security dashboard which offers the latest global and national data.

These are just a few examples of what the open data ecosystem has produced in recent months. If you would like to share with us any other news, leave us a comment or send us an e-mail to dinamizacion@datos.gob.es

19/12/2022

Use Case Observatory, a European Open Data Portal initiative to measure the impact of open data

Noticia

Measuring the impact of open data is one of the challenges facing open data initiatives. Ther are a variety of methods, most of which combine quantitative and qualitative analysis in order to understand the value of specific datasets.

In this context, data.europa.eu, the European Open Data Portal, has launched a Use Case Observatory. This is a research project on the economic, governmental, social and environmental impact of open data.

What is the Use Case Observatory?

For three years, from 2022 to 2025, the European Data Portal will monitor 30 cases of re-use of open data. The aim is to:

Assess how the impact of open data is created.
Share the challenges and achievements of the analysed re-use cases
Contribute to the debate on the methodology to be used to measure such impact.

The analysed use cases refer to four areas of impact:

Economic impact: includes reuse cases related to business creation and (re)training of workers, among others. For example, solutions that help identify public tenders or apply for jobs are included.
Governmental impact: This refers to reuse cases that drive e-government, transparency and accountability.
Social impact: includes cases of re-use in the fields of healthcare, welfare and tackling inequality.
Environmental impact: This is limited to cases of re-use that promote sustainability and energy reduction, including solutions related to air quality control or forest preservation.

To select the use cases, an inventory was made based on three sources: the examples collected in the maturity studies carried out each year by the European portal, the solutions participating in the EU Datathon and the examples of reuse available in the repository of use cases on data.europa.eu. Only projects developed in Europe were taken into account, trying to maintain a balance between the different countries. In addition, projects that had won an award or were aligned with the European Commission's priorities for 2019 to 2024 were highlighted. To finalise the selection process, data.europa.eu conducted interviews with representatives of the use cases that met the requirements and were interested in participating in the project.

Three Spanish projects among the use cases analysed

The selected use cases are shown in the following image:

Among them, there are three Spaniards:

In the Social Impact category is UniversiDATA-Lab, a public portal for the advanced and automatic analysis of datasets published by universities. This project, which won the first prize in the III Desafío Aporta, was conceived by the team that created UniversiData, a collaborative initiative oriented and driven by public universities with the aim of promoting open data in the higher education sector in Spain in a harmonised way. You can learn more about these projects in this interview.
In the same category we also find Tangible data, a project focused on the creation of sculptures based on data, to bring them closer to non-technical people. Among other data sources, it uses datasets from NASA or Our World in Data.
In the environment category is Planttes. This is a citizen science project designed to report on the presence of allergenic plants in our environment and the level of allergy risk depending on their condition. This project is promoted by the Aerobiological Information Point (PIA) of the Institute of Environmental Science and Technology (ICTA-UAB) and the Department of Animal Biology, Plant Biology and Ecology (BABVE), in collaboration with the Computer Vision Centre (CVC) and the Library Living Lab, all of them at the Autonomous University of Barcelona (UAB).

First report now available

As a result of the analysis carried out, three reports will be developed. The first report, which has just been published, presents the methodology and the 30 selected cases of re-use. It includes information on the services they offer, the (open) data they use and their impact at the time of writing. The report ends with a summary of the general conclusions and lessons learned from this first part of the research project, giving an overview of the next steps of the observatory.

The second and third reports, to be released in 2024 and 2025, will assess the progress of the same use cases and expand on the findings of this first volume. The reports will focus on identifying achievements and challenges over a three-year period, allowing concrete ideas to be extrapolated to improve methodologies for assessing the impact of open data.

The project was presented in a webinar on 7 October, a recording of which is available, together with the presentation used. Representatives from 4 of the use cases were invited to participate in the webinar: Openpolis, Integreat, ANP, and OpenFoodFacts.

07/12/2022

The main justice and society datasets from datos.gob.es

Noticia

Data science has a key role to play in building a more equitable, fair and inclusive world. Open data related to justice and society can serve as the basis for the development of technological solutions that drive a legal system that is not only more transparent, but also more efficient, helping lawyers to do their work in a more agile and accurate way. This is what is known as LegalTech, and includes tools that make it possible to locate information in large volumes of legal texts, perform predictive analyses or resolve legal disputes easily, among other things.

In addition, this type of data drives the development of solutions aimed at responding to the great social challenges facing humanity, helping to promote the common good, such as the inclusion of certain groups, aid for refugees and people in conflict zones or the fight against gender-based violence.

When we talk about open data related to justice and society, we refer both to legal data and to other data that can have an impact on universalising access to basic services, achieving equity, ensuring that all people have the same opportunities for development and promoting collaboration between different social agents.

What types of data on justice and society can I find in datos.gob.es?

On our portal you can access a wide catalogue of data that is classified by different sectors. The Legislation and Justice category currently has more than 5,000 datasets of different types, including information related to criminal offences, appeals or victims of certain crimes, among others. For its part, the Society and Welfare category has more than 8,000 datasets, including, for example, lists of aid, associations or information on unemployment.

Of all these datasets, here are some examples of the most outstanding ones, together with the format in which you can consult them:

At state level

Spanish Statistical Office (INE). Offences according to sex by Autonomous Communities and cities. CSV, XLSX, XLS, JSON, PC-Axis, HTML (landing page for data download)
Spanish Statistical Office (INE). 2030 Agenda SDG - Population at risk of poverty or social exclusion: AROPE indicator. CSV, XLS, XLSX, HTML (landing page for data download)
Spanish Statistical Office (INE). Internet use by demographic characteristics and frequency of use. CSV, XLSX, XLS, JSON, PC-Axis, HTML (landing page for data download)
Spanish Statistical Office (INE). Average expenditure according to size of the municipality of residence. CSV, XLSX, XLS, JSON, PC-Axis, HTML (landing page for data download)
Spanish Statistical Office (INE). Retirement age in access to Benefit. CSV, XLSX
Ministry of Justice. Judicial Census. XLSX, PDF, HTML (landing page for data download

At Autonomous Community level

Cantabrian Institute of Statistics. Statistics on annulments, separations and divorces. RDF-XML, XLS, JSON, ZIP, PC-Axis, HTML (landing page for data download).
Basque Government. Standards and laws in force applicable in the Basque Country. JSON, JSON-P, XML, XLSX.
Basque Government. Locating mass graves from the Civil War and Francoism. CSV, XLS, XML.
Generalitat Catalana. Minstry of Justice resources statistics. XLSX, HTML (landing page for data download).
Government of Catalonia. Youth justice statistics. XLSX, HTML (landing page de descarga de datos).
Autonomous Community of Navarre. Statistics on Transfer of Property Rights. XLSX, HTML (landing page for data download).
Principality of Asturias. Sustainable Development Goals indicators in Asturias. HTML, XLSX, ZIP.
Principality of Asturias. Justice in Asturias: staffing levels of the judicial bodies of the Principality of Asturias according to type. HTML (landing page for data download).
Cantabrian Institute of Statistics. Judges and magistrates active in the Canary Islands. HTML, JSON, PC-Axis.

A the local level

Santa Cruz de Tenerife City Council. Parking spaces for people with reduced mobility. SHP, KML, KMZ, RDF-XML, CSV, JSON, XLS
Madrid City Council. Justice Administration Offices in the city of Madrid. CSV, XML, RSS, RDF-XML, JSON, HTML (landing page for data download)
Gijón City Council. Security forces. JSON, CSV, XLS, PDF, HTML, TSV, texto, XML, HTML (landing page for data download)
Madrid City Council. Child and Family Care Centres. CSV, JSON, RDF-XML, XML, RSS, HTML (landing page for data download).
Zaragoza City Council. List of police stations. CSV, JSON.

Some examples of re-use of justice and social good related data

In the companies and applications section of datos.gob.es you can find some examples of solutions developed with open data related to justice and social good. One example is Papelea, a company that provides answers to users' legal and administrative questions. To this end, it draws on public information such as administrative procedures of the main administrations, legal regulations, jurisprudence, etc. Another example is the ISEAK Foundation, which specialises in the evaluation of public policies on employment, inequality, inclusion and gender, using public data sources such as the National Institute of Statistics, Social Security, Eurostat and Opendata Euskadi.

Internationally, there are also examples of initiatives created to monitor procedural cases or improve the transparency of police services. In Europe, there is a boom in the creation of companies focused on legal technology that seek to improve the daily life of citizens, as well as initiatives that seek to use data for equity. Concrete examples of solutions in this area are miHub for asylum seekers and refugees in Cyprus, or Surviving in Brussels, a website for the homeless and people in need of access to services such as medical help, housing, job offers, legal help or financial advice.

Do you know of a company that uses this kind of data or an application that relies on it to contribute to the advancement of society? Then do not hesitate to leave us a comment with all the information or send us an email to dinamizacion@datos.gob.es.

05/12/2022

Open data as a tool for education and training

Blog

The demand for professionals with skills related to data analytics continues to grow and it is already estimated that just the industry in Spain would need more than 90,000 data and artificial intelligence professionals to boost the economy. Training professionals who can fill this gap is a major challenge. Even large technology companies such as Google, Amazon or Microsoft are proposing specialised training programmes in parallel to those proposed by the formal education system. And in this context, open data plays a very relevant role in the practical training of these professionals, as open data is often the only possibility to carry out real exercises and not just simulated ones.

Moreover, although there is not yet a solid body of research on the subject, some studies already suggest positive effects derived from the use of open data as a tool in the teaching-learning process of any subject, not only those related to data analytics. Some European countries have already recognised this potential and have developed pilot projects to determine how best to introduce open data into the school curriculum.

In this sense, open data can be used as a tool for education and training in several ways. For example, open data can be used to develop new teaching and learning materials, to create real-world data-based projects for students or to support research on effective pedagogical approaches. In addition, open data can be used to create opportunities for collaboration between educators, students and researchers to share best practices and collaborate on solutions to common challenges.

Projects based on real-world data

A key contribution of open data is its authenticity, as it is a representation of the enormous complexity and even flaws of the real world as opposed to artificial constructs or textbook examples that are based on much simpler assumptions.

An interesting example in this regard is documented by Simon Fraser University in Canada in their Masters in Publishing where most of their students come from non-STEM university programmes and therefore had limited data handling skills. The project is available as an open educational resource on the OER Commons platform and aims to help students understand that metrics and measurement are important strategic tools for understanding the world around us.

By working with real-world data, students can develop story-building and research skills, and can apply analytical and collaborative skills in using data to solve real-world problems. The case study conducted with the first edition of this open data-based OER is documented in the book "Open Data as Open Educational Resources - Case studies of emerging practice". It shows that the opportunity to work with data pertaining to their field of study was essential to keep students engaged in the project. However, it was dealing with the messiness of 'real world' data that allowed them to gain valuable learning and new practical skills.

Development of new learning materials

Open datasets have a great potential to be used in the development of open educational resources (OER), which are free digital teaching, learning and research materials, as they are published under an open licence (Creative Commons) that allows their use, adaptation and redistribution for non-commercial uses according to UNESCO's definition.

In this context, although open data are not always OER, we can say that they become OER when are used in pedagogical contexts. Open data used as an educational resource facilitates students to learn and experiment by working with the same datasets used by researchers, governments and civil society. It is a key component for students to develop analytical, statistical, scientific and critical thinking skills.

It is difficult to estimate the current presence of open data as part of OER but it is not difficult to find interesting examples within the main open educational resource platforms. On the Procomún platform we can find interesting examples such as Learning Geography through the evolution of agrarian landscapes in Spain, which builds a Webmap for learning about agrarian landscapes in Spain on the ArcGIS Online platform of the Complutense University of Madrid. The educational resource uses specific examples from different autonomous communities using photographs or geolocated still images and its own data integrated with open data. In this way, students work on the concepts not through a mere text description but with interactive resources that also favour the improvement of their digital and spatial competences.

On the OER Commons platform, for example, we find the resource "From open data to civic engagement", which is aimed at audiences from secondary school upwards, with the objective of teaching them to interpret how public money is spent in a given regional, local area or neighbourhood. It is based on the well-known projects to analyse public budgets "Where do my taxes go?", available in many parts of the world as a result of the transparency policies of public authorities. This resource could be easily ported to Spain, as there are numerous "Where do my taxes go?" projects, such as the one maintained by Fundación Civio.

Data-related skills

When we refer to training and education in data-related skills, we are actually referring to a very broad area that is also very difficult to master in all its facets. In fact, it is common for data-related projects to be tackled in teams where each member has a specialised role in one of these areas. For example, it is common to distinguish at least data cleaning and preparation, data modelling and data visualisation as the main activities performed in a data science and artificial intelligence project.

In all cases, the use of open data is widely adopted as a central resource in the projects proposed for the acquisition of any of these skills. The well-known data science community Kaggle organises competitions based on open datasets contributed to the community and which are an essential resource for real project-based learning for those who want to acquire data-related skills. There are other subscription-based proposals such as Dataquest or ProjectPro but in all cases they use real datasets from multiple general open data repositories or knowledge area specific repositories.

Open data, as in other areas, has not yet developed its full potential as a tool for education and training. However, as can be seen in the programme of the latest edition of the OER Conference 2022, there are an increasing number of examples of open data playing a central role in teaching, new educational practices and the creation of new educational resources for all kinds of subjects, concepts and skills

Content written by Jose Luis Marín, Senior Consultant in Data, Strategy, Innovation & Digitalization.

The contents and views reflected in this publication are the sole responsibility of the author.

28/11/2022

Meet the winners of EU Datathon 2022

Noticia

On 20 October, the EU's open data competition came to an end after several months of competition. The final of this sixth edition of the EU Datathon was held in Brussels in the framework of the European Year of Youth and was streamed worldwide.

It is a competition that gives open data enthusiasts and application developers from around the world the opportunity to demonstrate the potential of open data, while their innovative ideas gain international visibility and compete for a portion of the total prize money of €200,000.

The finalist teams were pre-selected from a total of 156 initial submissions. They came from 38 different countries, the largest participation in the history of the competition, to compete in four different categories related to the challenges facing Europe today.

Before the final, the selected participants had the opportunity to present in video format each of the proposals they have been developing based on the open data from the European catalogues.

Here is a breakdown of the winning teams in each challenge, the content of the proposal and the amount of the prize.

Winners of the “European Green Deal” Challenge

The European Green Deal is the blueprint for a modern, sustainable and competitive European economy. Participants who took up the challenge had to develop applications or services aimed at creating a green Europe, capable of driving resource efficiency.

1st prize: CROZ RenEUwable (Croatia)

The application developed by this Croatian team, "renEUwable", combines the analysis of environmental, social and economic data to provide specific and personal recommendations on sustainable energy use.

Prize: €25,000

2nd prize: MyBioEUBuddy (France, Montenegro)

This project was created to help farm workers and local governments find regions that grow organic produce and can serve as an example to build a more sustainable agricultural network.

Prize: €15,000

3rd prize: Green Land Dashboard for Cities (Italy)

The bronze in this category went to an Italian project that aims to analyse and visualise the evolution of green spaces in order to help cities, regional governments and non-governmental organisations to make them more liveable and sustainable.

Prize: €7,000

"Winners of the “Transparency in Public Procurement” Challenge

Transparency in public procurement helps to track how money is spent, combat fraud and analyse economic and market trends. Participants who chose this challenge had to explore the information available to develop an application to improve transparency.

1st prize: Free Software Foundation Europe e.V (Germany)

This team of developers aims to make the links between the private sector, public administrations, users and tenders accessible.

Prize: €25,000

2nd prize: The AI-Team (Germany)

This is a project that proposes to visualise data from TED, the European public procurement journal, in a graphical database and combine them with ownership information and a list of sanctioned entities. This will allow public officials and competitors to trace the amounts and values of contracts awarded back to the owners of the companies.

Prize: €15,000

3rd prize: EMMA (France)

This fraud prevention and early detection tool allows public institutions, journalists and civil society to automatically monitor how the relationship between companies and administration is established at the beginning of a public procurement process.

Prize: €7,000

Winners of the “Public Procurement Opportunities for Young People” Challenge

Public procurement is often perceived as a complex field, where only specialists feel comfortable finding the information they need. Thus, the developers who participated in this challenge had to design, for example, apps aimed at helping young people find the information they need to apply for public procurement positions.

1st prize: Hermix (Belgium, Romania)

It is a tool that develops a strategic marketing methodology aimed at the B2G (business to government) sector so that it is possible to automate the creation and monitoring of strategies for this sector.

Prize: €25,000

2nd prize: YouthPOP (France)

YouthPOP is a tool designed to democratise employment and public procurement opportunities to bring them closer to young workers and entrepreneurs. It does this by combining historical data with machine learning technology.

Prize: €15,000

3rd prize: HasPopEU (Romania)

This proposal takes advantage of open EU public procurement data and machine learning techniques to improve the communication of the skills required to access this type of job vacancies. The application focuses on young people, immigrants and SMEs.

Prize: €7,000

Winners of the “A Europe Fit for the Digital Age” Challenge

The EU aims for a digital transformation that works for people and businesses. Therefore, participants in this challenge developed applications and services aimed at improving data skills, connectivity or data dissemination, always based on the European Data Strategy.

1st prize:: Lobium/Gavagai (Netherlands, Sweden, United Kingdom)

This application, developed using natural language processing techniques, was created with the aim of facilitating the work of investigative journalists, promoting transparency and rapid access to certain information.

Prize: €25,000

2nd prize: 100 Europeans (France)

It is an interactive app that uses open data to raise awareness of the great challenges of our time. In this way, and aware of how difficult it is to communicate the impact that these challenges have on society, '100 Europeans' changes the way of conveying the message and personalises the effects of climate change, pollution or overweight in a total of one hundred people. The aim of this project is to make society more aware of these challenges by telling them through the stories of people close to them.

Prize: €15,000

3rd prize: UNIOR NLP (Italy)

Leveraging European natural language processing techniques and data collection, the Computational Linguistics and Automatic Natural Language Processing research group at the University of Naples L'Orientale has developed a personal assistant called Maggie that guides users to explore cultural content across Europe, answering their questions and offering personalised suggestions.

Prize: €7,000

Finally, the Audience Award of this 2022 edition also went to CROZ RenEUwable, the same team that won the first prize in the challenge dedicated to fostering commitment to the European Green Pact.

photo of the winners of the EU datathon 2022

As in previous editions, the EU Datathon is a competition organised by the Publications Office of the European Union in collaboration with the European Data Strategy. Thus, the recently closed 2022 edition has managed to activate the support of some twenty partners representing open data stakeholders inside and outside the European institutions.

02/11/2022

Helpvoice!, Hospital Morbidity Survey and RIAN win the IV Aporta Challenge

Noticia

The IV edition of the Aporta Challenge, whose motto has revolved around 'The value of data for health and well-being of citizens', has already announced its three winners. The competition, promoted by Red.es in collaboration with the Secretary of State for Digitalisation and Artificial Intelligence, launched in November 2021 with an ideas competition and continued earlier this summer with a selection of ten finalist proposals.

As in the three previous editions, the selected candidates had a three month period to transform their ideas into a prototype, which they presented in person at the final gala.

In a post-pandemic context, where health plays an increasingly important role, the theme of the competition sought to identify, recognise and reward ideas aimed at improving the efficiency of this sector with solutions based on the use of open data.

On 18 October, the ten finalists came to the Red.es headquarters to present their proposals to a jury made up of representatives from public administrations, organisations linked to the digital economy, universities and data communities. In just twelve minutes, they had to summarise the purpose of the proposed project or service, explain how the development process had been carried out, what data they had used, and dwell on aspects such as the economic viability or traceability of the project or service.

Ten innovative projects to improve the health sector

The ten proposals presented to the jury showed a high level of innovation, creativity, rigour and public vocation. They were also able to demonstrate that it is possible to improve the quality of life of citizens by creating initiatives that monitor air quality, build solutions to climate change or provide a quicker response to a sudden health problem, among other examples.

For all these reasons, it is not surprising that the jury had a difficult time choosing the three winners of this fourth edition. In the end, HelpVoice initiative won the first prize of €5,000, the Hospital Morbidity Survey won the €4,000 linked to second place and RIAN, the Intelligent Activity and Nutrition Recommender, closed the ranking with third place and €3,000 as an award.

Winners of the IV APORTA Challenge: The value of data for health and well-being of citizens. First Prize: HelpVoice! by Data Express, composed of Sandra García, Antonio Ríos and Alberto Berenguer. Second prize: The Hospital Morbidity Survey by Marc Coca. Third prize: RIAN - Intelligent Activity and Nutrition Recommender by RIAN Open Data Team, composed of Jesús Noguera and Raúl Micharet.

First prize: HelpVoice!

Team: Data Express, composed of Sandra García, Antonio Ríos and Alberto Berenguer.

HelpVoice! is a service that helps our elderly through voice recognition techniques based on automatic learning. Thus, in an emergency situation, the user only need to click on a device that can be an emergency button, a mobile phone or home automation tools and tell about their symptoms. The system will send a report with the transcript and predictions to the nearest hospital, speeding up the response of the healthcare workers.

In parallel, HelpVoice! will also recommend to the patient what to do while waiting for the emergency services. Regarding the use of data, the Data Express team has used open information such as the map of hospitals in Spain and uses speech and sentiment recognition data in text.

Second prize: The Hospital Morbidity Survey

Team: Marc Coca Moreno

This is a web environment based on MERN, Python and Pentaho tools for the analysis and interactive visualisation of the Hospital Morbidity Survey microdata. The entire project has been developed with open source and free tools and both the code and the final product will be openly accessible.

To be precise, it offers 3 main analyses with the aim of improving health planning:

Descriptive: hospital discharge counts and time series.
KPIs: standardised rates and indicators for comparison and benchmarking of provinces and communities.
Flows: count and analysis of discharges from a hospital region and patient origin.

All data can be filtered according to the variables of the dataset (age, sex, diagnoses, circumstance of admission and discharge, etc.).

In this case, in addition to the microdata from the INE Hospital Morbidity Survey, statistics from the Continuous Register (also from the INE), data from the ICD10 diagnosis catalogues of the Ministry of Health and from the catalogues and indicators of the Agency for Healthcare Research and Quality (AHRQ) and of the Autonomous Communities, such as Catalonia: catalogues and stratification tools, have also been integrated.

You can see the result of this work here.

Third prize: RIAN - Intelligent Activity and Nutrition Recommender

Team: RIAN Open Data Team, composed of Jesús Noguera y Raúl Micharet..

This project was created to promote healthy habits and combat overweight, obesity, sedentary lifestyles and poor nutrition among children and adolescents. It is an application designed for mobile devices that uses gamification techniques, as well as augmented reality and artificial intelligence algorithms to make recommendations.

Users have to solve personalised challenges, individually or collectively, linked to nutritional aspects and physical activities, such as gymkhanas or games in public green spaces.

In relation to the use of open data, the pilot uses data related to green areas, points of interest, greenways, activities and events belonging to the cities of Malaga, Madrid, Zaragoza and Barcelona. In addition, these data are combined with nutritional recommendations (food data and nutritional values and branded food products) and data for food recognition by images from Tensorflow or Kaggle, among others.

Alberto Martínez Lacambra, Director General of Red.es presents the awards and announces a new edition

The three winners were announced by Alberto Martínez Lacambra, Director General of Red.es, at a ceremony held at Red.es headquarters on 27 October. The event was attended by several members of the jury, who were able to talk to the three winning teams.

Photo of the winners of the Desafío Aporta with Alberto Martínez LaCambra, Director General of Red.es, and several members of the jury.

Martínez Lacambra also announced that Red.es is already working to shape the V Aporta Challenge, which will focus on the value of data for the improvement of the common good, justice, equality and equity.

Once again this year, the Aporta Initiative would like to congratulate the three winners, as well as to thank the work and talent of all the participants who decided to invest their time and knowledge in thinking and developing proposals for the fourth edition of the Aporta Challenge.

27/10/2022

casos de uso

1. Introduction

2. Objective

3. Resources

3.1. Datasets

3.2. Tools

4. Data processing or preparation

5. Visualizations

5.1. Line charts

5.2. Bar charts

5.3. Histograms

5.4. Box and whisker diagrams

5.5. Pie charts

5.6. Scatter plots

5.7. Correlation matrix

6. Conclusions of the exercise

1. Introduction

2. Objective

3. Resources

3.1. Datasets

3.2. Tools

4. Data processing and preparation

5. Data enrichment

Step 1

Step 2

Step 3

Step 5

6. Map visualization

6.1 Map generation with "Google My Maps"

6.2 Personalization of the information to be displayed on the map

6.3 Map functionalities (layers, pins, routes and immersive 3D view)

Pins

Routes

3D immersive map

7. Conclusions of the exercise

1. Introduction

2. Objetive

3. Resources

3.1. Datasets

3.2. Tools

4. Data processing or preparation

5. Generating visualizations

5.1. Fleet of vehicles registered by Autonomous Communities

5.2. Ratio of positive toxicological results for different types of vehicles

5.3. Ratio of positive toxicological results for the Autonomous Communities

6. Conclusions of the study

New functionalities and partnerships

Open data tenders, a showcase for finding talent and new use cases

Dissemination of open data and their use cases through events

Guides and courses

International news

What is the Use Case Observatory?

Three Spanish projects among the use cases analysed

First report now available

What types of data on justice and society can I find in datos.gob.es?

A the local level

Some examples of re-use of justice and social good related data

Projects based on real-world data

Development of new learning materials

Data-related skills

Winners of the “European Green Deal” Challenge

1st prize: CROZ RenEUwable (Croatia)

2nd prize: MyBioEUBuddy (France, Montenegro)

3rd prize: Green Land Dashboard for Cities (Italy)

"Winners of the “Transparency in Public Procurement” Challenge

1st prize: Free Software Foundation Europe e.V (Germany)

2nd prize: The AI-Team (Germany)

3rd prize: EMMA (France)

Winners of the “Public Procurement Opportunities for Young People” Challenge

1st prize: Hermix (Belgium, Romania)

2nd prize: YouthPOP (France)

3rd prize: HasPopEU (Romania)

Winners of the “A Europe Fit for the Digital Age” Challenge

1st prize:: Lobium/Gavagai (Netherlands, Sweden, United Kingdom)

2nd prize: 100 Europeans (France)

3rd prize: UNIOR NLP (Italy)

Ten innovative projects to improve the health sector

First prize: HelpVoice!

Second prize: The Hospital Morbidity Survey

Third prize: RIAN - Intelligent Activity and Nutrition Recommender