Examples of uncommon open data repositories

Publication date 25/11/2019

Update date 26/11/2019

Description

Beyond public administrations, libraries, museums and cultural foundations data, the interest in open data knows no borders. We invite you to discover it in this post.

Normally, the concept of open data is associated with those repositories managed by public administrations, foundations and cultural organizations such as libraries and museums. But open data include much more and, if we use our search thoroughly, we can find real jewels waiting to be explored. Many times they are repositories on specific topics, very useful for professionals who develop their work in that field. Others, they are general repositories with unusual data sets.

Let's see some examples.

Open data and science

To illustrate the specific data repositories, let's focus on two examples of the scientific field:

1) Open data portal of the European Space Agency. On this website we can find a large number of images and data from the different space missions of the European Space Agency (for its acronym in English, ESA). For example, most satellite images of the Copernicus program - the most ambitious Earth observation program to date - provide accurate, timely and easily accessible information to improve environmental management, understand and mitigate the effects of climate change and ensure civil security.

Example of an open image under CC BY-SA 3.0 IGO license from the ESA open data repository, in particular from the Copernicus ground observation program.

ESA not only makes available images and videos from satellites, but also a large amount of observation data that can be processed to generate our own images or analysis. As an example, the data generated by the Gaia mission - the most ambitious mission to draw a three-dimensional map of our Galaxy - is available for direct download on this link. Browsing the links that depend on the main repository, we can access to files in .csv format of several tens of MB ready for analysis.

2) CERN open data portal. CERN is the European laboratory for nuclear research. The place where the Web (World Wide Web) was born, concentrates a good part of the best scientific talent in Europe and generates several dozen petabytes of data per year. In this way, CERN also has its own website dedicated to open data. The CERN open data site is a very user-friendly website for the non-expert user that proposes different ways of approaching the data stored there. There are different paths to explore the site depending on whether we follow the Learn, Visualize or Analyze path. This website is a vergel of data, but it is necessary to have basic (or not so basic) notions of particle physics, to exploit its full potential.

In addition to the core site, CERN makes available to (advanced) users a Github site so that those developers who want to work with open data, have a more suitable environment for the exploitation of data programmatically. Github sites or other open source repositories enhance the development of collaborative users’ communities around open data.

Very, very diverse data

But in addition to these specific repositories, there are also general theme repositories whit unusual data sets. We have already spoken on previous occasions of the Kaggle website. Kaggle is an open web platform aimed at data scientists in which challenges are posed (some of them paid with high cash prizes). This time we approached Kaggle only to explore its extensive catalog of data (mostly published under a Creative Commons license in one of its variants).

To cite some varied examples, looking in the first entries of its catalog we find data sets on the height of the waves on the Australian coast or for example, a data set that includes a list of 10,000 women's shoes with their prices published under license CC BY-NC-SA 4.0. You could not miss in this list one of the most popular and used data sets today. Every quarter, Stackoverflow, the largest online community for programmers, publishes an extraction of its database with the posts, votes, tags and comments that have passed through its platform. The analysis of this data set (published under CC BY-SA 3.0) of more than 100 GB in volume is probably the most accurate way to measure market trends in terms of popularity and use of existing programming languages.

In short, in addition to the existing data sets on mobility, environment, location of basic services in cities or cultural collections, there are open data repositories, much more specific, for those intrepid users who dare to investigate in search of less common data. Of course, the future of open data has no borders.

Content prepared by Alejandro Alija, expert in Digital Transformation and innovation.

Contents and points of view expressed in this publication are the exclusive responsibility of its author.

Kaggle