Blog

The demand for professionals with skills related to data analytics continues to grow and it is already estimated that just the industry in Spain would need more than 90,000 data and artificial intelligence professionals to boost the economy. Training professionals who can fill this gap is a major challenge. Even large technology companies such as Google, Amazon or Microsoft are proposing specialised training programmes in parallel to those proposed by the formal education system. And in this context, open data plays a very relevant role in the practical training of these professionals, as open data is often the only possibility to carry out real exercises and not just simulated ones.

Moreover, although there is not yet a solid body of research on the subject, some studies already suggest positive effects derived from the use of open data as a tool in the teaching-learning process of any subject, not only those related to data analytics. Some European countries have already recognised this potential and have developed pilot projects to determine how best to introduce open data into the school curriculum.

In this sense, open data can be used as a tool for education and training in several ways. For example, open data can be used to develop new teaching and learning materials, to create real-world data-based projects for students or to support research on effective pedagogical approaches. In addition, open data can be used to create opportunities for collaboration between educators, students and researchers to share best practices and collaborate on solutions to common challenges.

Projects based on real-world data

A key contribution of open data is its authenticity, as it is a representation of the enormous complexity and even flaws of the real world as opposed to artificial constructs or textbook examples that are based on much simpler assumptions.

An interesting example in this regard is documented by Simon Fraser University in Canada in their Masters in Publishing where most of their students come from non-STEM university programmes and therefore had limited data handling skills. The project is available as an open educational resource on the OER Commons platform and aims to help students understand that metrics and measurement are important strategic tools for understanding the world around us.

By working with real-world data, students can develop story-building and research skills, and can apply analytical and collaborative skills in using data to solve real-world problems. The case study conducted with the first edition of this open data-based OER is documented in the book "Open Data as Open Educational Resources - Case studies of emerging practice". It shows that the opportunity to work with data pertaining to their field of study was essential to keep students engaged in the project. However, it was dealing with the messiness of 'real world' data that allowed them to gain valuable learning and new practical skills.

Development of new learning materials

Open datasets have a great potential to be used in the development of open educational resources (OER), which are free digital teaching, learning and research materials, as they are published under an open licence (Creative Commons) that allows their use, adaptation and redistribution for non-commercial uses according to UNESCO's definition.

In this context, although open data are not always OER, we can say that they become OER when are used in pedagogical contexts. Open data used as an educational resource facilitates students to learn and experiment by working with the same datasets used by researchers, governments and civil society. It is a key component for students to develop analytical, statistical, scientific and critical thinking skills.

It is difficult to estimate the current presence of open data as part of OER but it is not difficult to find interesting examples within the main open educational resource platforms. On the Procomún platform we can find interesting examples such as Learning Geography through the evolution of agrarian landscapes in Spain, which builds a Webmap for learning about agrarian landscapes in Spain on the ArcGIS Online platform of the Complutense University of Madrid. The educational resource uses specific examples from different autonomous communities using photographs or geolocated still images and its own data integrated with open data. In this way, students work on the concepts not through a mere text description but with interactive resources that also favour the improvement of their digital and spatial competences.

On the OER Commons platform, for example, we find the resource "From open data to civic engagement", which is aimed at audiences from secondary school upwards, with the objective of teaching them to interpret how public money is spent in a given regional, local area or neighbourhood. It is based on the well-known projects to analyse public budgets "Where do my taxes go?", available in many parts of the world as a result of the transparency policies of public authorities. This resource could be easily ported to Spain, as there are numerous "Where do my taxes go?" projects, such as the one maintained by Fundación Civio.

Data-related skills

When we refer to training and education in data-related skills, we are actually referring to a very broad area that is also very difficult to master in all its facets. In fact, it is common for data-related projects to be tackled in teams where each member has a specialised role in one of these areas. For example, it is common to distinguish at least data cleaning and preparation, data modelling and data visualisation as the main activities performed in a data science and artificial intelligence project.

In all cases, the use of open data is widely adopted as a central resource in the projects proposed for the acquisition of any of these skills. The well-known data science community Kaggle organises competitions based on open datasets contributed to the community and which are an essential resource for real project-based learning for those who want to acquire data-related skills. There are other subscription-based proposals such as Dataquest or ProjectPro but in all cases they use real datasets from multiple general open data repositories or knowledge area specific repositories.

Open data, as in other areas, has not yet developed its full potential as a tool for education and training. However, as can be seen in the programme of the latest edition of the OER Conference 2022, there are an increasing number of examples of open data playing a central role in teaching, new educational practices and the creation of new educational resources for all kinds of subjects, concepts and skills


Content written by Jose Luis Marín, Senior Consultant in Data, Strategy, Innovation & Digitalization.

The contents and views reflected in this publication are the sole responsibility of the author.

calendar icon
Evento

UniversiDATA organizes its first Datathon.

Are you interested in data analysis? If the answer is yes, you will be interested to know that UniversiDATA has launched a datathon to recognize open data processing projects based on the datasets available on its portal. 
This initiative seeks not only to reward talent, but also to promote the content offered and encourage its use, inspiring new projects through the dissemination of reference cases. It is also an opportunity to get to know their reusers and better understand their objectives, challenges and needs. This valuable information will be used to drive improvements to the platform.

Who can participate?

The call is open to any natural person of legal age with tax residence in the European Union. Students, developers, researchers or any citizen interested in data analysis can participate.

How does the competition take place?

The competition is structured in two phases:

Elimination phase 

A jury of experts will evaluate all the applications received on the basis of the project descriptions provided by the participants in the registration form. Four criteria detailed in the competition rules will be taken into account: impact, quality of the proposal, usability/presentation and reusability.
After the evaluation, a maximum of 10 finalists will pass to the next phase.

Final phase

The finalists will have to present their projects via videoconference to the members of the jury. 
After the presentation, the three winners will be announced.

What are the requirements for the projects submitted?

The submitted projects must fulfill three obligations:

  • At least one dataset published on the open data portal UniversiDATA must be used. These datasets may be combined with secondary data from other sources.
  • The use made of the UniversiDATA datasets must be essential to obtain the results of the analysis.
  • The analysis of the data must have a clear purpose or utility. Generic data treatments will not be accepted. 

It should be noted that existing work, previously presented in other contexts, may be submitted. Furthermore, there are no limitations regarding the technology to be used for the analysis -each user may use the tools he/she considers appropriate- or regarding the format of the final result of the analysis. That is to say, you can compete with a mobile App, a web application, a data analysis in Jupyter or R-Markdown, etc...

What are the prizes?

The three winners will receive a total of €3,500, distributed as follows:

  • First prize: €2,000 
  • Second prize: 1,000 €. 
  • Third prize: 500 €.

The other finalists will receive a diploma of recognition.

How can I participate?

The registration period is now open. The easiest way to participate is to fill out the form available at the end of the competition website before October 14, 2022.
Before registering, we recommend that you review all the complete information in the competition rules. If you have any questions, you can contact the organizers. Various channels will soon be available for participants to ask questions regarding not only the competition procedures, but also the platform's datasets.

Competition schedule

  • Receipt of applications: until October 14, 2022
  • Publication of the 10 finalists: November 14, 2022
  • Communication of the three winners: December 16, 2022

Find out more about UniversiDATA

UniversiDATA was born at the end of 2020 with the aim of promoting open data in the higher education sector in Spain in a harmonized way. It is a public-private initiative that currently encompasses Universidad Rey Juan Carlos, Universidad Complutense de Madrid, Universidad Autónoma de Madrid, Universidad Carlos III de Madrid, Universidad de Valladolid, Universidad de Huelva and "Dimetrical, The Analytics Lab", S.L.
In 2021 UniversiDATA won the First Prize of the III Desafío Aporta for its project UniversiDATA-Lab, a public portal for the advanced and automatic analysis of datasets published by universities. You can learn more about the project in this interview.

calendar icon
Empresa reutilizadora

Estudio Alfa is a technology company dedicated to offering services that promote the image of companies and brands on the Internet, including the development of apps. To carry out these services, they use techniques and strategies that comply with usability standards and favour positioning in search engines, thus helping their clients' websites to receive more visitors and thus potential clients. They also have special experience in the production and tourism sectors.

 

calendar icon
Blog

Spanish universities have numerous quality data that have great economic and social potential. For this reason, for some time now we have been witnessing a data opening movement by universities in our country, with the aim of promoting the use and reuse of the information they generate, as well as improving its transparency, among others.

As a result of this openness movement, open data portals linked to universities have emerged, many of which are federated with datos.gob.es, such as the University of Cantabria, the University of Extremadura or Pablo de Olavide University, although there are many more. There have even been collaborative initiatives such as UniversityDATA, which seeks to promote open data in the higher education sector in a harmonized way. UniversiData currently brings together 6 universities, but its intention is to continue growing and act as a single access point where the different universities can share their data in a simple way. Also noteworthy is the role of the Ministry of Universities, which currently offers thus 1,000 datasets through datos.gob.es.

Between the datasets shared by universities we find administrative data, on the profile of the students or the different academic itineraries, although they also have a relevant role in the publication of the data of their research groups.

But the role of universities in the open data ecosystem goes far beyond mere publishers, but also has a prominent role in promoting the opening of more data and its reuse.

Examples of open data groups linked to universities

The field of data in general, and that of open data in particular, has more and more weight and importance within the university environment, in academic matters but also in the area of ​​dissemination. As a consequence, today we find various examples of groups, chairs or professionals linked to higher education focused on promoting the use and reuse of academic data.

Next, we analyze some groups of open data linked to the field of higher education in our country:

Chair of Transparency and Open Data – Valencian Observatory of Open Data and Transparency

What is it and what is its purpose?

This organization has been created by the Valencian Generalitat and the Polytechnic University of Valencia, through its Higher School of Computer Engineering and the University Master's Degree in Information Management, to carry out activities related to transparency, participation and citizen access to public information in the Valencian Community.

What functions does it carry out?

The chair organizes and develops a series of activities related to the field of transparency and open data that can be classified into three different categories:

  • Training activities: The chair collaborates in teaching activities, such as the Official University Master's Degree in Information Management; in the implementation of permanent training programs aimed at officials, local entities and associations; or the creation of scholarships and prizes for the best degree projects in the field of transparency for UPV students.
  • Technical, technological and artistic dissemination activities: Through workshops, conferences and conferences, it seeks to promote both the publication and reuse of open data. They also develop work spaces together with the administration's technical staff, computer programmers and public managers.
  • Research and development activities: From the chair they work on different lines of research, especially in the field of open data from the web GVAObert of the Generalitat. They also promote innovative technological solutions in terms of transparency, participation and access to public information.

Next steps

In addition to the Datathon, they also plan to hold an international conference with the collaboration of Open Knowledge Foundation, a series of news verification workshops through open data together with Newtral, various dissemination and training sessions with several business federations, or a workshop on open data from a gender perspective. Among his plans is also to create a directory of reusers of the Valencian Community.

Although work is being done on them, these events still do not have an official agenda and calendar.

Cajasiete Chair of Big Data, Open Data and Blockchain, University of La Laguna

What is it and what is its purpose?

The University of La Laguna and Cajasiete create last year the Big Data, Open Data and Blockchain Chair with the purpose of contributing to digital transformation and the development of a competitive economy in the Canary Islands, through research, training and dissemination in fields such as Big Data, Open Data or Blockchain. This Chair has in its Advisory Council a group of representatives of public and private entities of the archipelago, which are key players when it comes to promoting digital transformation and the development of public-private business models in the Canary Islands.

What functions does it carry out?

One of their most outstanding projects has been "Mapa BOB", where they analyzed in detail the situation of the Canarian entities in terms of digitization to take concrete actions that help fulfill their mission. After this analysis, from the chair they have focused on various initiatives with the aim of raising the level of digitization in the archipelago, many of which have been focused on open data:

  • BOB Webinars: They have developed online seminars on specific topics to learn about and disseminate good practices in the field of technology and data.
  • Meetings of Public Administrations: They have also set up spaces so that different AA.PP. Canary Islands can share experiences on specific issues related to data. In the last face-to-face edition there was a discussion table dedicated to the analysis of open data.
  • Projects: Among the projects they have carried out, it stands out “Practical application of open data: update and prediction”. This is a Final Degree Project whose objective was based on defining a quality indicator to, based on open data related to COVID-19, make predictions of the evolution of the accumulated incidence.
  • Training courses: Within the field of open data, it has organized different training courses, such as:

- Course on the CKAN platform for the Cabildo de Lanzarote.

- Masterclass on Open Data and Big Data for business students at the University of La Laguna.

- Open Data courses for Public Administrations within the initiative "Elements to modernize the Administration and improve the provision of public services in the current context" of the University of La Laguna.

- Python course, whose final project consisted of analyzing open tourism data on the island of Tenerife.

Next steps

Currently, the BOB Chair is collaborating in the initiative "DIGINNOVA, a training program to train unemployed people in digitization matters specific to the needs of the islands. With regard to 2022, various dissemination, training and research activities have been planned with which they hope to continue growing and adding value, although at the moment none of them have been officially presented yet.

Hablando en data

What is it and what is its purpose?

It is a dissemination channel aimed at sharing information and knowledge of interest with the community of data scientists. In this space we can find interviews, review concepts, news or events on Data Science and Artificial Intelligence. In this case we are not talking about a project linked to a university, but to its professionals. This channel has been created by Abilio Romero Ramos and Ivan Robles Acute, two data scientists with extensive experience in the sector who dedicate part of their professional time to teaching subjects related to the field of data in different master's degrees at higher education centers such as the European University and CEU San Pablo University.

What functions does it carry out?

Their YouTube channel has become the central axis of its dissemination project. Through it they try to approach and explain in an entertaining and simple way various concepts necessary to take advantage of the value that is hidden in the data, especially in those that are available to the public.

On the other hand, in his web page they centralize and collect a multitude of content related to data science. In it we can identify some of the functions that this initiative carries out, such as:

  • Divulgation: They have a section dedicated to the dissemination of blogs, websites and channels that have information of interest on this subject.
  • Training: On its website we can find a section dedicated to disseminating training materials of interest on data science, either through free books, documentaries or online courses.
  • Data science for school: an initiative that tries to bring data to the entire population, especially students. Here they share useful and fun resources to help young people get into the world of data science.

The collaboration that this initiative carries out with SoGoodData, an NGO that helps and supports social projects thanks to the combination of public information and that shared by collaborating companies.

Next steps

The closest future plans for this initiative are to continue disseminating knowledge about data science among its community and to continue taking advantage of the possibilities that this field offers. They will continue to spread the use and reuse of public data, not only from their YouTube channel but also from the masters in which they teach and through initiatives such as SoGoodData.

 

As we have seen, university entities and their professionals have acquired a relevant role in the open data sector in our country, as publishers, disseminators and reusers, although that is for another article.

Do you know of any other initiative for the dissemination of open data in the university field? Leave us a comment or write us an email at dinamizacion@datos.gob.es and tell us. We will be happy to read you.

calendar icon
Documentación

1. Introduction

Visualizations are graphical representations of data that allow to transmit in a simple and effective way the information linked to them. The visualization potential is very wide, from basic representations, such as a graph of lines, bars or sectors, to visualizations configured on control panels or interactive dashboards. Visualizations play a fundamental role in drawing conclusions from visual information, also allowing to detect patterns, trends, anomalous data, or project predictions, among many other functions.  

Before proceeding to build an effective visualization, we need to perform a previous treatment of the data, paying special attention to obtaining them and validating their content, ensuring that they are in the appropriate and consistent format for processing and do not contain errors. A preliminary treatment of the data is essential to perform any task related to the analysis of data and of performing an effective visualization. 

In the \"Visualizations step-by-step\" section, we are periodically presenting practical exercises on open data visualization that are available in the datos.gob.es catalog or other similar catalogs. There we approach and describe in a simple way the necessary steps to obtain the data, perform the transformations and analyzes that are pertinent to, finally, we create interactive visualizations, from which we can extract information that is finally summarized in final conclusions. 

In this practical exercise, we have carried out a simple code development that is conveniently documented by relying on tools for free use. All generated material is available for reuse in the GitHub Data Lab repository.

Access the data lab repository on Github.

Run the data pre-processing code on Google Colab.

2. Objetives

The main objective of this post is to learn how to make an interactive visualization based on open data. For this practical exercise we have chosen datasets that contain relevant information about the students of the Spanish university over the last few years. From these data we will observe the characteristics presented by the students of the Spanish university and which are the most demanded studies. 

3. Resources

3.1. Datasets

For this practical case, data sets published by the Ministry of Universities have been selected, which collects time series of data with different disaggregations that facilitate the analysis of the characteristics presented by the students of the Spanish university. These data are available in the datos.gob.es catalogue and in the Ministry of Universities' own data catalogue. The specific datasets we will use are: 

3.2. Tools

To carry out the pre-processing of the data, the R programming language has been used from the Google Colab cloud service, which allows the execution of Notebooks de Jupyter.

Google Colaboratory also called Google Colab, is a free cloud service from Google Research that allows you to program, execute and share code written in Python or R from your browser, so it does not require the installation of any tool or configuration.

For the creation of the interactive visualization the Datawrapper tool has been used. 

Datawrapper is an online tool that allows you to make graphs, maps or tables that can be embedded online or exported as PNG, PDF or SVG. This tool is very simple to use and allows multiple customization options. 

If you want to know more about tools that can help you in the treatment and visualization of data, you can use the report \"Data processing and visualization tools\"

4. Data pre-processing

As the first step of the process, it is necessary to perform an exploratory data analysis (EDA) in order to properly interpret the initial data, detect anomalies, missing data or errors that could affect the quality of subsequent processes and results, in addition to performing the tasks of transformation and preparation of the necessary variables. Pre-processing of data is essential to ensure that analyses or visualizations subsequently created from it are reliable and consistent. If you want to know more about this process you can use the Practical Guide to Introduction to Exploratory Data Analysis

The steps followed in this pre-processing phase are as follows: 

  1. Installation and loading the libraries
  2. Loading source data files
  3. Creating work tables
  4. Renaming some variables
  5. Grouping several variables into a single one with different factors
  6. Variables transformation
  7. Detection and processing of missing data (NAs)
  8. Creating new calculated variables
  9. Summary of transformed tables
  10. Preparing data for visual representation
  11. Storing files with pre-processed data tables 

You'll be able to reproduce this analysis, as the source code is available in this GitHub repository. The way to provide the code is through a document made on a Jupyter Notebook that once loaded into the development environment can be executed or modified easily. Due to the informative nature of this post and in order to facilitate learning of non-specialized readers, the code does not intend to be the most efficient, but rather make it easy to understand, therefore it is likely to come up with many ways to optimize the proposed code to achieve a similar purpose. We encourage you to do so!  

You can follow the steps and run the source code on this notebook in Google Colab. 

5. Data visualizations

Once the data is pre-processed, we proceed with the visualization. To create this interactive visualization we use the Datawrapper tool in its free version. It is a very simple tool with special application in data journalism that we encourage you to use. Being an online tool, it is not necessary to have software installed to interact or generate any visualization, but it is necessary that the data table that we provide is properly structured.

To address the process of designing the set of visual representations of the data, the first step is to consider the queries we intent to resolve. We propose the following:

  • How is the number of men and women being distributed among bachelor´s, master's and PHD students over the last few years? 

If we focus on the last academic year 2020-2021:

  • What are the most demanded fields of science in Spanish universities? What about degrees?
  • Which universities have the highest number of enrolments and where are they located?
  • In what age ranges are bachelor´s university students?
  • What is the nationality of bachelor´s students from Spanish universities? 

Let's find out by looking at the data!

5.1. Distribution of enrolments in Spanish universities from the 2015-2016 academic year to 2020-2021, disaggregated by gender and academic level

We created this visual representation taking into account the bachelor, master and PHD enrolments. Once we have uploaded the data table to Datawrapper (dataset \"Matriculaciones_NivelAcademico\"), we have selected the type of graph to be made, in this case a stacked bar diagram to be able to reflect by each course and gender, the people enrolled in each academic level. In this way we can also see the total number of students enrolled per course. Next, we have selected the type of variable to represent (Enrolments) and the disaggregation variables (Gender and Course). Once the graph is obtained, we can modify the appearance in a very simple way, modifying the colors, the description and the information that each axis shows, among other characteristics.

To answer the following questions, we will focus on bachelor´s students and the 2020-2021 academic year, however, the following visual representations can be replicated for master's and PHD students, and for the different courses.

5.2. Map of georeferenced Spanish universities, showing the number of students enrolled in each of them

To create the map, we have used a list of georeferenced Spanish universities published by the Open Data Portal of Esri Spain. Once the data of the different geographical areas have been downloaded in GeoJSON format, we transform them into Excel, in order to combine the datasets of the georeferenced universities and the dataset that presents the number of enrolled by each university that we have previously pre-processed. For this we have used the Excel VLOOKUP() function that will allow us to locate certain elements in a range of cells in a table

Before uploading the dataset to Datawrapper, we need to select the layer that shows the map of Spain divided into provinces provided by the tool itself. Specifically, we have selected the option \"Spain>>Provinces(2018)\". Then we proceed to incorporate the dataset \"Universities\", previously generated, (this dataset is attached in the GitHub datasets folder for this step-by-step visualization), indicating which columns contain the values of the variables Latitude and Longitude.  

From this point, Datawrapper has generated a map showing the locations of each of the universities. Now we can modify the map according to our preferences and settings. In this case, we will set the size and the color of the dots dependent from the number of registrations presented by each university. In addition, for this data to be displayed, in the \"Annotate\" tab, in the \"Tooltips\" section, we have to indicate the variables or text that we want to appear.  

5.3. Ranking of enrolments by degree

For this graphic representation, we use the Datawrapper table visual object (Table) and the \"Titulaciones_totales\" dataset to show the number of registrations presented by each of the degrees available during the 2020-2021 academic year. Since the number of degrees is very extensive, the tool offers us the possibility of including a search engine that allows us to filter the results. 

5.4. Distribution of enrolments by field of science

For this visual representation, we have used the \"Matriculaciones_Rama_Grado\" dataset and selected sector graphs (Pie Chart), where we have represented the number of enrolments according to sex in each of the field of science in which the degrees in the universities are divided (Social and Legal Sciences, Health Sciences, Arts and Humanities, Engineering and Architecture and Sciences). Just like in the rest of the graphics, we can modify the color of the graph, in this case depending on the branch of teaching.   

5.5. Matriculaciones de Grado por edad y nacionalidad

For the realization of these two representations of visual data we use bar charts (Bar Chart), where we show the distribution of enrolments in the first, disaggregated by gender and nationality, we will use the data set \"Matriculaciones_Grado_nacionalidad\" and in the second, disaggregated by gender and age, using the data set \"Matriculaciones_Grado_edad \". Like the previous visuals, the tool easily facilitates the modification of the characteristics presented by the graphics.  

6. Conclusions

Data visualization is one of the most powerful mechanisms for exploiting and analyzing the implicit meaning of data, regardless of the type of data and the degree of technological knowledge of the user. Visualizations allow us to extract meaning out of the data and create narratives based on graphical representation. In the set of graphical representations of data that we have just implemented, the following can be observed: 

  • The number of enrolments increases throughout the academic years regardless of the academic level (bachelor´s, master's or PHD).
  • The number of women enrolled is higher than the men in bachelor's and master's degrees, however it is lower in the case of PHD enrollments, except in the 2019-2020 academic year.
  • The highest concentration of universities is found in the Community of Madrid, followed by the autonomous community of Catalonia.
  • The university that concentrates the highest number of enrollments during the 2020-2021 academic year is the UNED (National University of Distance Education) with 146,208 enrollments, followed by the Complutense University of Madrid with 57,308 registrations and the University of Seville with 52,156.
  • The most demanded degree in the 2020-2021 academic year is the Degree in Law with 82,552 students nationwide, followed by the Degree in Psychology with 75,738 students and with hardly any difference, the Degree in Business Administration and Management with 74,284 students.
  • The branch of education with the highest concentration of students is Social and Legal Sciences, while the least demanded is the branch of Sciences.
  • The nationalities that have the most representation in the Spanish university are from the region of the European Union, followed by the countries of Latin America and the Caribbean, at the expense of the Spanish one.
  • The age range between 18 and 21 years is the most represented in the student body of Spanish universities. 

We hope that this step-by-step visualization has been useful for learning some very common techniques in the treatment and representation of open data. We will return to show you new reuses. See you soon! 

 

calendar icon
Entrevista

Publishing open data following the best practices of linked data allows to boost its reuse. Data and metadata are described using RDF standards that allow representing relationships between entities, properties and values. In this way, datasets are interconnected with each other, regardless of the digital repository where they are located, which facilitates their contextualization and exploitation.

If there is a field where this type of data is especially valued, it is research. It is therefore not surprising that more and more universities are beginning to use this technology. This is the case of the University of Extremadura (UEX), which has a research portal that automatically compiles the scientific production linked to the institution. Adolfo Lozano, Director of the Office of Transparency and Open Data at the University of Extremadura and collaborator in the preparation of the "Practical guide for the publication of linked data in RDF", tells us how this project was launched.

Full interview:

1. The research portal of the University of Extremadura is a pioneering initiative in our country. How did the project come about?

The research portal of the University of Extremadura has been launched about a year ago, and has had a magnificent reception among UEX researchers and external entities looking for the lines of work of our researchers.

But the UEX open data portal initiative started in 2015, applying the knowledge of our research group Quercus of the University of Extremadura on semantic representation, and with the experience we had in the open data portal of the City Council of Cáceres. The biggest boost came about 3 years ago when the Vice-Rectorate for Digital Transformation created the UEX Office of Transparency and Open Data.

From the beginning, we were clear that we wanted a portal with quality data, with the highest level of reusability, and where international standards would be applied. Although it was a considerable effort to publish all the datasets using ontological schemas, always representing the data in RDF, and linking the resources as usual practice, we can say that in the medium term the benefits of organizing the information in this way gives us great potential to extract and manage the information for multiple purposes.

We wanted a portal with quality data, with the highest level of reusability, and where international standards. [...] it was a considerable effort to publish all the datasets using ontological schemas, always representing the data in RDF, and linking the resources.

2. One of the first steps in a project of this type is to select vocabularies, that allow conceptualizing and establishing semantic relationships between data. Did you have a good starting point or did you need to develop a vocabulary ex-profeso for this context? Does the availability of reference vocabularies constitute a brake on the development of data interoperability?

One of the first steps in following ontology schemas in an open data portal is to identify the most appropriate terms to represent the classes, attributes and relationships that will configure the datasets. And it is also a practice that continues as new datasets are incorporated.

In our case, we have tried to reuse the most extended vocabularies as possible such as foaf, schema, dublin core and also some specific ones such as vibo or bibo. But in many cases we have had to define our own terms in our ontology because those components did not exist. In our opinion, when the CRUE-TIC Hercules project would be operational and the generic ontology schemas for universities would be defined, it will greatly improve the interoperability between our data, and above all it will encourage other universities to create their open data portals with these models.

One of the first steps in following ontology schemas in an open data portal is to identify the most appropriate terms to represent the classes, attributes and relationships that will configure the datasets.

3. How did you approach the development of this initiative, what difficulties did you encounter and what profiles are necessary to carry out a project of this type?

In our opinion, if you want to make a portal that is useful in the medium term, it is clear that an initial effort is required to organize the information. Perhaps the most complicated thing at the beginning is to collect the data that are scattered in different services of the University in multiple formats, understand what they consist of, find the best way to represent them, and then coordinate how to access them periodically for updates.

In our case, we have developed specific scripts for different data source formats, from different UEX Services (such as the IT Service, the Transfer Service, or from external publication servers) and that transform them into RDF representation. In this sense, it is essential to have Computer Engineers specialized in semantic representation and with extensive knowledge of RDF and SPARQL. In addition, of course, different services of the University must be involved to coordinate this information maintenance.

4. How do you assess the impact of the initiative? Can you tell us with some success stories of reuse of the provided datasets?

From the logs of queries, especially to the research portal, we see that many researchers use the portal as a data collection point that they use to prepare their resumes. In addition, we know that companies that need some specific development, use the portal to obtain the profile of our researchers.

But, on the other hand, it is common that some users (inside and outside the UEX) ask us for specific queries to the portal data. And curiously, in many cases, it is the University's own services that provide us with the data that ask us for specific lists or graphs where they are linked and crossed with other datasets of the portal.

By having the data linked, a UEX professor is linked to the subject he/she teaches, the area of knowledge, the department, the center, but also to his/her research group, to each of his/her publications, the projects in which he/she participates, the patents, etc. The publications are linked to journals and these in turn with their impact indexes.

On the other hand, the subjects are linked to the degrees where they are taught, the centers, and we also have the number of students enrolled in each subject, and quality and user satisfaction indexes. In this way, complex queries and reports can be made by handling all this information together.

As use cases, for example, we can mention that the Word documents of the 140 quality commissions of the degrees are automatically generated (including annual evolution graphs and lists) by means of queries to the opendata portal. This has saved dozens of hours of joint work for the members of these commissions.

Another example, which we have completed this year, is the annual research report, which has also been generated automatically through SPARQL queries. We are talking about more than 1,500 pages where all the scientific production and transfer of the UEX is exposed, grouped by research institutes, groups, centers and departments.

As use cases, for example, we can mention that the Word documents of the 140 quality commissions of the degrees are automatically generated (including annual evolution graphs and lists) by means of queries to the opendata portal. This has saved dozens of hours of joint work for the members of these commissions.

5. What are the future plans of the University of Extremadura in terms of open data?

Much remains to be done. For now we are addressing first of all those topics that we have considered to be most useful for the university community, such as scientific production and transfer, and academic information of the UEX. But in the near future we want to develop datasets and applications related to economic issues (such as public contracts, evolution of expenditure, hiring tables) and administrative issues (such as the teaching organization plan, organization chart of Services, compositions of governing bodies, etc.) to improve the transparency of the institution.

calendar icon
Entrevista

Google is a company with a strong commitment to open data. It has launched Google Dataset Search, to locate open data in existing repositories around the world, and also offers its own datasets in open format as part of its Google Research initiative. In addition, it is a reuser of open data in solutions such as Google Earth.

Among its areas of work is Google for Education, with solutions designed for teachers and students. In datos.gob.es we have interviewed Gonzalo Romero, director of Google for Education in Spain and member of the jury in charge of evaluating the proposals received in the III edition of Desafío Aporta. Gonzalo talked to us about his experience, the influence of open data in the education sector and the importance of open data.

Full interview:

1. What challenges does the education sector face in Spain and how can open data and data-driven technologies help to overcome them?

Last year, due to the pandemic, the education sector was forced to accelerate its digitalization process so that the activity could develop as normally as possible.

The main challenges facing the education sector in Spain are technology and digitization as this sector is less digitized than average. Secure, simple and sustainable digital tools are needed so that the education system, from teachers and students to administrators, can operate easily and without any problems.

Open data makes it possible to locate certain quality information from thousands of sources quickly and easily at any time. These repositories create a reliable data sharing ecosystem that encourages publishers to publish data to drive student learning and the development of technology solutions.

2. Which datasets are most in demand for implementing educational solutions?

Each region usually generates its own. The main challenge is how new datasets can be created in collaboration with the variables that allow them to create predictive models to anticipate the main challenges they face, such as school dropout, personalization of learning or academic and professional orientation, among others.

3. How can initiatives such as hackathons or challenges help drive data-driven innovation? How was your experience in the III Aporta Challenge?

It is essential to support projects and initiatives that develop innovative solutions to promote the use of data.

Technology offers tools that help to find synergies between public and private data to develop technological solutions and promote different skills among students.

4. In addition to being the basis for technological solutions, open data also plays an important role as an educational resource in its own right, as it can provide knowledge in multiple areas. To what extent does this type of resource foster critical thinking in students?

The use of open data in the classroom is a way to boost and foster students' educational skills. For a good use of these resources it is important to search and filter the information according to the needs, as well as to improve the ability to analyse data and argumentation in a reasoned way. In addition, it allows the student to manage technological programs and tools.

These skills are useful for the future not only academically but also in the labour market, since more and more professionals with skills related to analytical capacity and data management are in demand.

5. Through your Google Research initiative, multiple projects are being carried out, some of them linked to the opening and reuse of open data. Why is it important that private companies also open data?

We understand the difficulties that private companies may have if they share data since sharing their information can be an advantage for competitors. However, it is essential to combine public and private sector data to drive the growth of the open data market that can lead to new analyses and studies and the development of new products and services.

It is also important to approach data reuse in the light of new and emerging social challenges and to facilitate the development of solutions without having to start from scratch.

6.What are Google's future plans for open data?

Sensitive corporate data has high survivability requirements, in case a provider has to cancel cloud services due to policy changes in a country or region, and we believe it is not possible to secure data with a proprietary solution. However, we do have open source and open standards tools that address multiple customer concerns.

Data analysis tools such as BigQuery or BigQuery Omni allow customers to make their own data more open, both inside and outside their organization. The potential of that data can then be harnessed in a secure and cost-efficient way. We already have clear use cases of value created with our data and artificial intelligence technology, and endorsed by the CDTI, such as the Student Success data dropout prevention model. Leading educational institutions already use it on a daily basis and it is in pilot phase in some education departments.

The company's goal is to continue working to build an open cloud hand in hand with our local partners and public institutions in Spain and across Europe, creating a secure European digital data ecosystem with the best technology.

calendar icon
Noticia

How time flies! It seems like only yesterday when, at this time of year, we were writing our letter to Santa Claus and the Three Wise Men asking them for our most sincere wishes. Once again, Christmas is here to remind us of the magic of reuniting with our loved ones, but it is also the perfect time to enjoy and rest.

For many, within that Christmas happiness is the passion for reading. What better time to enjoy a good book than these winter days under the warmth of home? Novels, comics, essays... but also guides or theoretical books that can help you expand your knowledge related to your field of work. Therefore, as every year, we have asked our pages -the collaborators of datos.gob.es- the best recommendations on books related to data and technology to offer you some ideas that you can include in your letter this year, if you have behaved well.

Telling your Data Story: Data Storytelling for Data Management, Scott Taylor (The Data Whisperer)

What is it about? The author of this book offers us a practical guide to explain and make us understand the strategic value of data management within the business environment.

Who is it for? It is focused on working professionals interested in improving their skills both in managing and carrying out a company's data plan, as well as in handling tools to be able to clearly explain their actions to third parties. Its target audience also includes data scientists interested in bringing this discipline closer to the business sector.

Language: English

The art of statistics: Learning from data, David Spiegelhalter

What is it about? This book shows readers how to derive knowledge from raw data and mathematical concepts. Through real-world examples, the author tells us how data and statistics can help us solve different problems, such as determining the luckiest passenger on the Titanic or whether a serial killer could have been caught earlier, among others.

Who is it for? If you are passionate about the world of statistics and curious data, this book is for you. Its readability and full of examples taken from the world around us makes this book an interesting idea to include in your letter to the Three Wise Men this year.

Language: English

Big Data. Conceptos, tecnologías y aplicaciones, David Ríos Insúa y David Gómez Ullate Oteiza

What is it about? This new CSIC book brings readers closer to big data and its applications in politics, healthcare and cybersecurity. Its authors, David Ríos and David Gómez-Ullate describe the technology and methods used by data science, explaining its potential in various fields.

Who is it for? Anyone interested in expanding their knowledge of current scientific and technological issues will find this book more than interesting. Its simple and accessible explanations make this manual a pleasant and friendly text for all types of readers.

Language: Spanish

Data Analytics with R: A Recipe book, Ryan Garnett

What is it about? As if it were a recipe book! This is how Ryan Garnet presents this book dedicated to explaining in an entertaining and very practical way to the readers the data analysis focused on the R language.

Who is it for? This book is a very interesting option for both programmers and data analysis enthusiasts who want to discover more about R. Its structure in the form of recipes to explain this field makes it easy to understand. In addition, you can download it for free.

Language: English

Datanomics, Paloma Llaneza

What is it about? This book reveals with data, reports and proven facts what technology companies are really doing with the personal data we give them and how they profit from it.

Who is it for? It is a document of great interest to all citizens. The fact that the information it contains is reinforced with supporting reports makes it lighter and more enjoyable to read.

Language: Spanish

Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us about Who We Really Are, Seth Stephens-Davidowitz

What is it about? Did you know that users surfing the Internet expose a total of 8 billion gigabytes of data every day? The author shows us in this book how the data we provide about our searches reveal our fears, desires and behaviors, but also conscious and unconscious biases.

Who is it for? This book is aimed at anyone looking to expand their knowledge about how we express ourselves in the digital age. If you would like answers to questions such as whether where we go to school can influence our future success, this book is for you.

Language: English

As in previous years, this list is just a small selection that we have made based on the recommendations suggested by some of the experts who collaborate in datos.gob.es, as well as some of the members of this team. Undoubtedly, the world is full of really interesting books on data and technology.

Do you know of any more that we should not forget in our letter to the Three Wise Men? Leave us a comment or send us an email to dinamizacion@datos.gob.es, we will be happy to read it!

calendar icon
Entrevista

Open data is not only a matter of public administrations, more and more companies are also betting on them. This is the case of Microsoft, who has provided access to selected open data in Azure designed for the training of Machine Learning models. He also collaborates in the development of multiple projects in order to promote open data. In Spain, it has collaborated in the development of the platform HealthData 29, intended for the publication of open data to promote medical research.

We have interviewed Belén Gancedo, Director of Education at Microsoft Ibérica and member of the jury in the III edition of the Aporta Challenge,focused on the value of data for the education sector. We met with her to talk about the importance of digital education and innovative data-driven solutions, as well as the importance of open data in the business sector.

Complete interview:

1. What challenges in the education sector, to which it is urgent to respond, has the pandemic in Spain revealed?

Technology has become an essential element in the new way of learning and teaching. During the last months, marked by the pandemic, we have seen how a hybrid education model - face-to-face and remotely - has changed in a very short time. We have seen examples of centers that, in record time, in less than 2 weeks, have had to accelerate the digitization plans they already had in mind.

Technology has gone from being a temporary lifeline, enabling classes to be taught in the worst stage of the pandemic, to becoming a fully integrated part of the teaching methodology of many schools. According to a recent YouGov survey commissioned by Microsoft, 71% of elementary and middle school educators say that technology has helped them improve their methodology and improved their ability to teach. In addition, 82% of teachers report that the pace at which technology has driven innovation in teaching and learning has accelerated in the past year.

Before this pandemic, in some way, those of us who had been dedicating ourselves to education were the ones who defended the need to digitally transform the sector and the benefits that technology brought to it. However, the experience has served to make everyone aware of the benefits of the application of technology in the educational environment. In that sense, there has been an enormous advance. We have seen a huge increase in the use of our Teams tool, which is already used by more than 200 million students, teachers, and education staff around the world.

The biggest challenges, then, currently, are to not only take advantage of data and Artificial Intelligence to provide more personalized experiences and operate with greater agility, but also the integration of technology with pedagogy, which will allow more flexible, attractive learning experiences and inclusive. Students are increasingly diverse, and so are their expectations about the role of college education in their journey to employment.

The biggest challenges, then, currently, are to not only take advantage of data and Artificial Intelligence to provide more personalized experiences and operate with greater agility, but also the integration of technology with pedagogy, which will allow more flexible, attractive learning experiences and inclusive.

2. How can open data help drive these improvements? What technologies need to be implemented to drive improvements in the efficiency and effectiveness of the learning system?

Data is in all aspects of our lives. Although it may not be related to the mathematics or algorithm that governs predictive analytics, its impact can be seen in education by detecting learning difficulties before it is too late. This can help teachers and institutions gain a greater understanding of their students and information on how to help solve their problems.

Predictive analytics platforms and Artificial Intelligence technology have already been used with very positive results by different industries to understand user behavior and improve decision-making. With the right data, the same can be applied in classrooms. On the one hand, it helps to personalize and drive better learning outcomes, to create inclusive and personalized learning experiences, so that each student is empowered to succeed. If its implementation is correct, it allows a better and greater monitoring of the needs of the student, who becomes the center of learning and who will enjoy permanent support.

At Microsoft we want to be the ideal travel companion for the digital transformation of the education sector. We offer educational entities the best solutions -cloud and hardware- to prepare students for their professional future, in a complete environment of collaboration and communication for the classroom, both in face-to-face and online models. Solutions like Office 365 Education and the Surface device are designed precisely to drive collaboration both inside and outside the classroom. The educational version of Microsoft Teams makes a virtual classroom possible. It is a free tool for schools and universities that integrates conversations, video calls, content, assignments and applications in one place, allowing teachers to create learning environments that are lively and accessible from mobile devices,

And, in addition, we make available to schools, teachers and students devices specifically designed for the educational environment, such as the Surface Go 2, expressly designed for the educational environment. It is an evolutionary device, that is, it adapts to any educational stage and boosts the creativity of students thanks to its power, versatility and safety. This device allows the mobility of both teachers and students inside and outside the classroom; connectivity with other peripheral devices (printers, cameras ...); and includes the Microsoft Classroom Pen for natural writing and drawing in digital ink.

3. There is increasing demand for digital skills and competencies related to data. In this sense, the National Plan for Digital Skills, which includes the digitization of education and the development of digital skills for learning. What changes should be made in educational programs in order to promote the acquisition of digital knowledge by students?

Without a doubt, one of the biggest challenges we face today is the lack of training and digital skills. According to a study carried out by Microsoft and EY, 57% of the companies surveyed expect AI to have a high or very high impact in business areas that are "totally unknown to companies today."

There is a clear opportunity for Spain to lead in Europe in digital talent, consolidating itself as one of the most attractive countries to attract and retain this talent. A recent LinkedIn study anticipates that two million technology-related jobs will be created in Spain in the next five years, not only in the technology industry, but also,and above all, in companies in other sectors of activity that seek to incorporate the necessary talent to carry out their transformation. However, there is a shortage of professionals with skills and training in digital skills. According to data from the Digital Economy and Society Index Report published annually by the European Commission, Spain is below the European average in most of the indicators that refer to the digital skills of Spanish professionals.

There is, therefore, an urgent demand to train qualified talent with digital skills, data management, AI, machine learning ... Technology-related profiles are among the most difficult to find and, in the near future, those related to technology data analytics, cloud computing and application development.

For this, adequate training is necessary, not only in the way of teaching, but also in the curricular content. Any career, not just those in the STEM field, would need to include subjects related to technology and AI, which will define the future. The use of AI reaches any field, not only technology, therefore, students of any type of career -Law, Journalism ... - to give some examples of non-STEM careers, need qualified training in technology such as AI or data science, since they will have to apply it in their professional future.

We must bet on public-private collaborations and involve the technology industry, public administrations, the educational community, adapting the curricular contents of the University to the labor reality- and third sector entities, with the aim of promoting employability and professional recycling. In this way, the training of professionals in areas such as quantum computing, Artificial Intelligence, or data analytics and we can aspire to digital leadership.

In the next five years, two million technology-related jobs will be created in Spain, not only in the technology industry, but also, and above all, in companies in other sectors of activity that seek to incorporate the necessary talent to lead carry out your transformation.

4. Even today we find a disparity between the number of men and women who choose professional branches related to technology. What is needed to promote the role of women in technology?

According to the National Observatory of Telecommunications and Information Society -ONTSI- (July 2020), the digital gender gap has been progressively reduced in Spain, going from 8.1 to 1 point, although women maintain an unfavorable position in digital skills and Internet use. In advanced skills, such as programming, the gap in Spain is 6.8 points, the EU average being 8 points. The percentage of researchers in the ICT services sector drops to 23.4%. And in terms of the percentage of graduates in STEM, Spain ranks 12th within the EU, with a difference between the sexes of 17 points.

Without a doubt, there is still a long way to go. One of the main barriers that women face in the technology sector and when it comes to entrepreneurship are stereotypes and cultural tradition. The masculinized environment of technical careers and stereotypes about those who are dedicated to technology make them unattractive careers for women.

Digitization is boosting the economy and promoting business competitiveness,as well as generating an increase in the creation of specialized employment. Perhaps the most interesting thing about the impact of digitization on the labor market is that these new jobs are not only being created in the technology industry, but also in companies from all sectors, which need to incorporate specialized talent and digital skills.

Therefore, there is an urgent demand to train qualified talent with digital capabilities and this talent must be diverse. The woman cannot be left behind. It is time to tackle gender inequality, and alert everyone to this enormous opportunity, regardless of their gender. STEM careers are an ideal future option for anyone, regardless of gender.

Forfavor the female presence in the technology sector, in favor of a digital era without exclusion, at Microsoft we have launched different initiatives that seek to banish stereotypes and encourage girls and young people to take an interest in science and technology and make them see that they they can also be the protagonists of the digital society. In addition to the WONNOW Awards that we convened with CaixaBank, we also participate and collaborate in many initiatives, such as the Ada Byron Awards together with the University of Deusto, to help give visibility to the work of women in the STEM field, so that they are references of those who They are about to come.

The digital gender gap has been progressively reduced in Spain, going from 8.1 to 1 point, although women maintain an unfavorable position in digital skills and Internet use. In advanced skills, such as programming, the gap in Spain is 6.8 points, the EU average being 8 points.

5. How can initiatives like hackathons, challenge or challenges help drive data-driven innovation? How was your experience in the III Aporta Challenge?

These types of initiatives are key to that much-needed change. At Microsoft we are constantly organizing hackathons on a global, regional and local scale, to innovate in different priority areas for the company, such as education.

But we go further. We also use these tools in class. One of Microsoft's bets is the projects STEM hacking.These are projects in which the “maker” concept of learning by doing with programming and robotics is mixed, through the use of everyday materials. What's more,They are made up of activities that allow teachers to guide their students to construct and create scientific instruments and project-based tools to visualize data through science, technology, engineering, and mathematics. Our projects -both Hacking STEM as well as coding and computational language through the use of free tools such as Make Code- aim to bring programming and robotics to any subject in a transversal way, and why not, learn programming in a Latin class or in a biology one.

My experience in the III Aporta Challenge has been fantastic because it has allowed me to learn about incredible ideas and projects where the usefulness of the amount of data available becomes a reality and is put at the service of improving the education of all. There has been a lot of participation and, in addition, with very careful and worked presentations. The truth is that I would like to take this opportunity to thank everyone who has participated and also congratulate the winners.

6. A year ago, Microsoft launched a campaign to promote open data in order to close the gap between countries and companies that have the necessary data to innovate and those that do not. What has the project consisted of? What progress has been made?

Microsoft's global initiative Open Data Campaign seeks to help close the growing “data gap” between the small number of technology companies that benefit most from the data economy today and other organizations that are hampered by lack of access to data or lack of capabilities to use the ones you already have.

Microsoft believes that more needs to be done to help organizations share and collaborate around data so that businesses and governments can use it to meet the challenges they face, as the ability to share data has huge benefits. And not only for the business environment, but they also play a critical role in helping us understand and address major challenges, such as climate change, or health crises, such as the COVID-19 pandemic. To take full advantage of them, it is necessary to develop the ability to share them in a safe and reliable way, and to allow them to be used effectively.

Within the Open Data Campaign initiative, Microsoft has announced 5 great principles that will guide how the company itself approaches how to share its data with others:

  • Open- Will work to make relevant data on large social issues as open as possible.
  • Usable- Invest in creating new technologies and tools, governance mechanisms and policies so that data can be used by everyone.
  • Boosters- Microsoft will help organizations generate value from their data and develop AI talent to use it effectively.
  • Insurance- Microsoft will employ security controls to ensure data collaboration is secure at the operational level.
  • Private- Microsoft will help organizations protect the privacy of individuals in data-sharing collaborations that involve personally identifiable information.

We continue to make progress in this regard. Last year, Microsoft Spain, next to Foundation 29, the Chair on Privacy and Digital Transformation Microsoft-Universitat de València and with the legal advice of the law firm J&A Garrigues have created the Guide "Health Data"that describes the technical and legal framework to carry out the creation of a public repository of health systems data, and that these can be shared and used in research environments and LaLiga is one of the entities that has shared, in June of this year, its anonymized data.

Data is the beginning of everything and one of our biggest responsibilities as a technology company is to help conserve the ecosystem on a large scale, on a planetary level. For this, the greatest challenge is to consolidate not only all the available data, but the artificial intelligence algorithms that allow access to it and allow making decisions, creating predictive models, scenarios with updated information from multiple sources. For this reason, Microsoft launched the concept of Planetary Computer, based on Open Data, to make more than 10 Petabytes of data - and growing - available to scientists, biologists, startups and companies, free of charge, from multiple sources (biodiversity, electrification , forestry, biomass, satellite), APIs, Development Environments and applications (predictive model, etc.) to create a greater impact for the planet.

Microsoft's global initiative Open Data Campaign seeks to help close the growing “data gap” between the small number of technology companies that benefit most from the data economy today and other organizations that are hampered by lack of access to data or lack of capabilities to use the ones you already have.

7. They also offer some open data sets through their Azure Open Datasets initiative. What kind of data do they offer? How can users use them?

This initiative seeks that companies improve the accuracy of the predictions of their Machine Learning models and reduce the time of data preparation, thanks to selected data sets of public access, ready to use and easily accessible from the Azure services.

There is data of all kinds: health and genomics, transport, labor and economy, population and security, common data ... that can be used in multiple ways. And it is also possible to contribute datasets to the community.

8. Which are the Microsoft's future plans for open data?

After a year with the Opendata campaign, we have had many learnings and, in collaboration with our partners, we are going to focus next year on practical aspects that make the process of data sharing easier. We just started publishing materials for organizations to see the nuts and bolts of how to start sharing data. We will continue to identify possible collaborations to solve social challenges on issues of sustainability, health, equity and inclusion. We also want to connect those who are working with data or want to explore that realm with the opportunities offered by the Microsoft Certifications in Data and Artificial Intelligence. And, above all, this issue requires a good regulatory framework and, for this, it is necessary that those who define the policies meet with the industry.

calendar icon
Documentación

1. Introduction

Data visualization is a task linked to data analysis that aims to graphically represent underlying data information. Visualizations play a fundamental role in the communication function that data possess, since they allow to drawn conclusions in a visual and understandable way, allowing also to detect patterns, trends, anomalous data or to project predictions, alongside with other functions. This makes its application transversal to any process in which data intervenes. The visualization possibilities are very numerous, from basic representations, such as a line graphs, graph bars or sectors, to complex visualizations configured from interactive dashboards.   

Before we start to build an effective visualization, we must carry out a pre-treatment of the data, paying attention to how to obtain them and validating the content, ensuring that they do not contain errors and are in an adequate and consistent format for processing. Pre-processing of data is essential to start any data analysis task that results in effective visualizations.

A series of practical data visualization exercises based on open data available on the datos.gob.es portal or other similar catalogues will be presented periodically. They will address and describe, in a simple way; the stages necessary to obtain the data, perform the transformations and analysis that are relevant for the creation of interactive visualizations, from which we will be able summarize on in its final conclusions the maximum mount of information. In each of the exercises, simple code developments will be used (that will be adequately documented) as well as free and open use tools. All generated material will be available for reuse in the Data Lab repository on Github. 

 

Visualización

Visualization of the teaching staff of Castilla y León classified by Province, Locality and Teaching Specialty 

2. Objetives

The main objective of this post is to learn how to treat a dataset from its download to the creation of one or more interactive graphs. For this, datasets containing relevant information on teachers and students enrolled in public schools in Castilla y León during the 2019-2020 academic year have been used. Based on these data, analyses of several indicators that relate teachers, specialties and students enrolled in the centers of each province or locality of the autonomous community. 

3. Resources

3.1. Datasets

For this study, datasets on Education published by the Junta de Castilla y León have been selected, available on the open data portal datos.gob.es. Specifically: 

3.2. Tools

To carry out this analysis (work environment set up, programming and writing) Python (versión 3.7) programming language and JupyterLab (versión 2.2) have been used. This tools will be found  integrated in Anaconda, one of the most popular platforms to install, update or manage software to work with Data Science. All these tools are open and available for free. 

 

 JupyterLab is a web-based user interface that provides an interactive development environment where the user can work with so-called Jupyter notebooks on which you can easily integrate and share text, source code and data.  

To create the interactive visualization, the Kibana tool (versión 7.10) has been used.

Kibana is an open source application that is part of the Elastic Stack product suite (Elasticsearch, Logstash, Beats and Kibana) that provides visualization and exploration capabilities of indexed data on top of the Elasticsearch analytics engine..

If you want to know more about these tools or others that can help you in the treatment and visualization of data, you can see the recently updated \"Data Processing and Visualization Tools\" report. 

4. Data processing

As a first step of the process, it is necessary to perform an exploratory data analysis (EDA) to properly interpret the starting data, detect anomalies, missing data or errors that could affect the quality of subsequent processes and results. Pre-processing of data is essential to ensure that analyses or visualizations subsequently created from it are consistent and reliable. 

Due to the informative nature of this post and to favor the understanding of non-specialized readers, the code does not intend to be the most efficient, but to facilitate its understanding. So you will probably come up with many ways to optimize the proposed code to get similar results. We encourage you to do so!  You will be able to reproduce this analysis since the source code is available in our Github account. The way to provide the code is through a document made on JupyterLab that once loaded into the development environment you can execute or modify easily.  

4.1. Installation and loading of libraries

The first thing we must do is import the libraries for the pre-processing of the data. There are many libraries available in Python but one of the most popular and suitable for working with these datasets is Pandas.  The Pandas library is a very popular library for manipulating and analyzing datasets.

 import  pandas as pd  

4.2. Loading datasets

First, we download the datasets from the open data catalog datos.gob.es and upload them into our development environment as tables to explore them and perform some basic data cleaning and processing tasks. For the loading of the data we will resort to the function read_csv(), where we will indicate the download url of the dataset, the delimiter (\"\";\"\" in this case) and, we add the parameter \"encoding\"\" that we adjust to the value \"\"latin-1\"\", so that it correctly interprets the special characters such as the letters with accents or \"\"ñ\"\" present in the text strings of the dataset.

 #Cargamos el dataset de las plantillas jurídicas de los centros públicos de Castilla y León de todos los cuerpos de profesorado, a excepción de los maestros url =  \"https://datosabiertos.jcyl.es/web/jcyl/risp/es/educacion/plantillas-centros-educativos/1284922684978.csv\"docentes = pd.read_csv(url, delimiter=\";\", header=0, encoding=\"latin-1\")docentes.head(3)#Cargamos el dataset de los alumnos matriculados en los centros educativos públicos de Castilla y León  alumnos = pd.read_csv(\"matriculaciones.csv\", delimiter=\",\", names=[\"Municipio\", \"Matriculaciones\"], encoding=\"latin-1\") alumnos.head(3)

The column \"\"Localidad\"\" of the table \"\"alumnos\"\" is composed of the code of the municipality and the name of the same. We must divide this column in two, so that its treatment is more efficient.  

columnas_Municipios = alumnos[\"Municipio\"].str.split(\" \", n=1, expand =  TRUE)alumnos[\"Codigo_Municipio\"] = columnas_Municipios[0]alumnos[\"Nombre_Munipicio\"] = columnas_Munipicio[1]alumnos.head(3)

4.3. Creating a new table

Once we have both tables with the variables of interest, we create a new table resulting from their union. The union variables will be: \"\"Localidad\"\" in the table of \"\"docentes\"\" and \"\"Nombre_Municipio” in the table of \"\"alumnos\". 

docentes_alumnos = pd.merge(docentes, alumnos, left_on = \"Localidad\", right_on = \"Nombre_Municipio\")docentes_alumnos.head(3)

4.4. Exploring the dataset

Once we have the table that interests us, we must spend some time exploring the data and interpreting each variable. In these cases, it is very useful to have the data dictionary that always accompanies each downloaded dataset to know all its details, but this time we do not have this essential tool. Observing the table, in addition to interpreting the variables that make it up (data types, units, ranges of values), we can detect possible errors such as mistyped variables or the presence of missing values (NAs) that can reduce analysis capacity. 

docentes_alumnos.info()

In the output of this section of code, we can see the main characteristics of the table: 

  • Contains a total of 4,512 records
  • It is composed of 13 variables, 5 numerical variables (integer type) and 8 categorical variables (\"object\" type)
  • There is no missing of values.  

Once we know the structure and content of the table, we must rectify errors, as is the case of the transformation of some of the variables that are not properly typified, for example, the variable that houses the center code (\"Código.centro\").  

docentes_alumnos.Codigo_centro = data.Codigo_centro.astype(\"object\")docentes_alumnos.Codigo_cuerpo = data.Codigo_cuerpo.astype(\"object\")docentes_alumnos.Codigo_especialidad = data.Codigo_especialidad.astype(\"object\")

Once we have the table free of errors, we obtain a description of the numerical variables, \"\"Plantilla\" and \"\"Matriculaciones\", which will help us to know important details. In the output of the code that we present below we observe the mean, the standard deviation, the maximum and minimum number, among other statistical descriptors. 

docentes_alumnos.describe()

4.5. Save the dataset

Once we have the table free of errors and with the variables that we are interested in graphing, we will save it in a folder of our choice to use it later in other analysis or visualization tools. We will save it in CSV format encoded as UTF-8 (Unicode Transformation Format) so that special characters are correctly identified by any tool we might use later. 

df = pd.DataFrame(docentes_alumnos)filname =  \"docentes_alumnos.csv\"df.to_csv(filename, index = FALSE, encoding = \"utf-8\")

5. Creation of the visualization on the teachers of the public educational centers of Castilla y León using the Kibana tool

For the realization of this visualization, we have used the Kibana tool in our local environment. To do this it is necessary to have Elasticsearch and Kibana installed and running. The company Elastic makes all the information about the download and installation available in this tutorial. 

Attached below are two video tutorials, which shows the process of creating the visualization and the interaction with the generated dashboard. 

In this first video, you can see the creation of the dashboard by generating different graphic representations, following these steps:  

  1. We load the table of previously processed data into Elasticsearch and generate an index that allows us to interact with the data from Kibana. This index allows search and management of data, practically in real time.
  2. Generation of the following graphical representations:
    • Graph of sectors where to show the teaching staff by province, locality and specialty.
    • Metrics of the number of teachers by province.
    • Bar chart, where we will show the number of registrations by province.
    • Filter by province, locality and teaching specialty.
  3. Construction of the dashboard. 

 

In this second video, you will be able to observe the interaction with the dashboard generated previously.  

6. Conclusions

Observing the visualization of the data on the number of teachers in public schools in Castilla y León, in the academic year 2019-2020, the following conclusions can be obtained, among others: 

  • The province of Valladolid is the one with both the largest number of teachers and the largest number of students enrolled. While Soria is the province with the lowest number of teachers and the lowest number of students enrolled.
  • As expected, the localities with the highest number of teachers are the provincial capitals.
  • In all provinces, the specialty with the highest number of students is English, followed by Spanish Language and Literature and Mathematics.
  • It is striking that the province of Zamora, although it has a low number of enrolled students, is in fifth position in the number of teachers. 

This simple visualization has helped us to synthesize a large amount of information and to obtain a series of conclusions at a glance, and if necessary, make decisions based on the results obtained. We hope you have found this new post useful and we will return to show you new reuses of open data. See you soon! 

calendar icon