When we think of open data our first intuition is usually directed towards data generated by public sector bodies in the exercise of their functions and made available for reuse by citizens and businesses, i.e. public sector open data or open public data. This is natural, because public sector information represents an extraordinary source of data and the intelligent use of this data, including its processing through artificial intelligence applications, has great transformative potential in all sectors of the economy, as recognised by the European directive on open data and re-use of public sector information.
One of the most interesting novelties introduced by the directive was the initial but expandable definition of 6 thematic categories of high-value datasets, whose re-use is associated with considerable benefits for society, the environment and the economy. These six areas - Geospatial, Earth Observation and Environment, Meteorology, Statistics, Societies and Corporate Ownership and Mobility - are the ones that in 2019 were considered to have the greatest potential for the creation of value-added services and applications based on such datasets. However, looking ahead to 2021, which is almost a year into the global health crisis, it seems clear that this list misses two key areas with a high potential impact on society, namely health and education.
Indeed, we find that on the one hand, educational institutions are explicitly exempted from some obligations in the directive, and on the other hand, health sector data are hardly mentioned at all. The directive, therefore, does not provide for a development of these two areas that the circumstances of the covid-19 pandemic have brought to the forefront of society's priorities.
The availability of health and education data
Although health systems, both public and private, generate and store an enormous amount of valuable data in people's medical records, the availability of these data is very limited due to the very high complexity of processing them in a secure way. Health-related datasets are usually only available to the entity that generates them, despite the great value that their release could have for the advancement of scientific research.
The same could be said for data generated by student interaction with educational platforms, which is also generally not available as open data. As in the health sector, these datasets are usually only available to their owners, for whom they are a valuable asset for the improvement of the platforms, which is only a small part of their potential value to society.
The directive states that high-value data should be published in open formats that can be freely used, re-used and shared by anyone for any purpose. Furthermore, in order to ensure maximum impact and facilitate re-use, high-value datasets should be made available for re-use with very few legal restrictions and at no cost.
Health data are highly sensitive to the privacy of individuals, so the delicate trade-off between respect for privacy and the need to support the advancement of scientific research must always be kept in mind. The consideration of health and education data as high-value open data should probably maintain some particular restrictions due to the nature and sensitivity of these data and promote figures such as the donation of data for research purposes by patients or the exchange for the same purpose between researchers. In this sense, the 2018 regulation on data protection introduced the possibility of reusing data for research purposes, provided that the appropriate pseudonymisation measures and the rest of the legally stipulated guarantees are adopted.
The importance of public-private partnerships
Education and health are two areas where the private sector or public-private partnerships are making exciting strides in converting some of the potential of open data into benefits for society. Open data publishing is not the exclusive preserve of the public sector and there is a long tradition of private-public collaboration, largely channelled through universities. Let us look at some examples:
- There are a number of initiatives such as the pioneering The UCI Machine Learning Repository founded in 1987 as a repository of datasets used by the artificial intelligence community for empirical analysis of machine learning algorithms. This repository has been cited more than 1000 times, the highest number of citations obtained in the computer science domain. In this and other repositories also managed by universities or foundations with donations from private companies, we can also find open datasets released by companies or in which they have actively collaborated in their creation or development.
- Also large technology companies, no doubt inspired by these initiatives, maintain open data search engines or repositories such as Google's dataset search engine, AWS's open data registry, or Microsoft Azure's datasets, where datasets related to health or education are increasingly common.
- In terms of data that can contribute to improving education, for example, The Open University publishes OULAD (OpenUniversity Learning Analytics Dataset), an open learning analytics dataset containing data on courses, students and their interactions with the virtual learning environment for seven courses. However, there are very few comparable datasets whose joint use in projects would undoubtedly allow further progress to be made in areas such as detecting the risk of students dropping out.
- As far as the health sector is concerned, it is worth highlighting the case of the Spanish platform HealthData 29, developed by Fundación 29, which aims to create the necessary infrastructure to make it possible to securely publish open health datasets so that they are available to the community for research purposes. As part of this infrastructure, Foundation 29 has published the Health Data Playbook, which is a guide for the creation, within the current technical and legal framework, of a public repository of data from health systems, so that they can be used in medical research. Microsoft has collaborated in the preparation of this guide as a technological partner and Garrigues as a legal partner, and it is aimed at organisations that carry out health research.
At the moment the platform only has available the Covid Data Save Lives (COVIDDSL) dataset published by the HM Hospitales University Hospital Group, composed of clinical data on interactions recorded in the covid-19 treatment process. However, it is an excellent example of the potential that we may be missing out on globally by not collecting and publishing more and better data on patients diagnosed with covid-19 in a systematised way and on a global scale. The creation of predictive models of disease progression in patients, the development of epidemiological models on the spread of the virus, or the extraction of knowledge on the behaviour of the virus for vaccine development are just some of the use cases that would benefit from greater availability of this data.
Education and health are two of the great concerns of all developed societies in the world because they are closely related to the well-being of their citizens. But perhaps we have never been more aware of this than in the last year and this represents an extraordinary opportunity to drive initiatives that contribute to unlocking more open health and education data. Whether as high-value data or in any other form, these datasets are key to enabling us to better react to future health crisis situations but also to help us overcome the aftermath of the current one.
Content prepared by Jose Luis Marín, Senior Consultant in Data, Strategy, Innovation & Digitalization.
The contents and points of view reflected in this publication are the sole responsibility of its author.
The new Directive on the opening of data and the reuse of public sector information, which was adopted last June, will replace and improve the old Directive 2003/98 / EC on the reuse of public sector information. Among the most significant changes within this new Directive is the objective of specifying a list of high-value datasets among those held by public sector bodies.
The creation of a list like this is a very important milestone because, for the first time in 15 years of Directive, we will have an explicit and common guide on what are the minimum datasets that should always be available, as well as the conditions for their reuse throughout the European Union - which will include their reuse for free, through application programming interfaces (APIs), in a machine-readable format and, where appropriate, including the bulk download option.
The questions we all ask ourselves immediately are: what are the high-value data they refer to? And what are the specific criteria that we should apply when identifying such high-value data?
The Directive defines high-value data as “documents whose reuse is associated with important benefits for society, the environment and the economy, in particular because of their suitability for the creation of value-added services, applications and new, high-quality and decent jobs, and of the number of potential beneficiaries of the value-added services and applications based on those datasets”. This definition offers several clues as to how these high-value datasets are expected to be identified through a series of indicators that would include:
- Their potential to generate significant social or environmental benefits.
- Their potential to generate economic benefits and new income.
- Their potential to generate innovative services;
- Their potential to benefit a high number of users, in particular SMEs
- Their potential to be combined with other datasets.
On the other hand, the Commission opened a consultation process some years ago that has served to evaluate public opinion on the priority of the data to be published. There are also several studies and reference entities in which the Commission has been inspired and which have been publishing its own recommendations related to high strategic value datasets, such as:
- The results of the MEPSIR study on the exploitation of the information resources of the European Union.
- The technical annex of the G8 Open Data Charter.
- The matters that generate business by the infomediary sector in Spain, according to the analysis of the sector carried out by ONTSI.
- The criteria established by the ISA program of interoperability solutions of the European Commission.
- Standard UNE 178301:2015on Open Data in Smart Cities.
- The data analyzed by the Open Data Barometer and the Global Open Data Index..
- The datasets to be published proposed by the Federation of Municipalities and Provinces - FEMP.
In addition, the Directive itself offers us once again another additional clue in its annex on what datasets could be finally selected for their high-value, through a series of priority domains that largely coincide with the proposals made by the organisms mentioned above: geospatial data, earth observation and environmental, meteorological, statistical, companies records or transport data.
It should also be remembered that the data related to some of the aforementioned topics are also regulated by specific sectoral legislation - such as Directive 2007/2 / EC on spatial data (INSPIRE), Directive 2003/4/EC on environmental information and Directive 2010/40 / EU on transport data - and therefore such legislation should also be taken into account when defining the final scope of application.
However, as the new Directive clarifies, neither the thematic list is closed nor the specific datasets are still defined. And it is that the European Commission has recently commissioned a new impact study precisely with the objective of defining in detail and substantiating what those datasets called “high-value” should finally be. However, there are also critical voices that cry out for the need for a better definition of the analysis criteria when deciding what these data will eventually be, and also for involving the whole society in the process. Fortunately, both critics and the Commission agree that the solution is to broaden the debate and establish a series of public and expert consultations - as is already reflected in the Directive and in the planned impact study - such as case of the debate that will take place in the next edition of the Aporta Meeting on December 18 in Madrid and whose motto is precisely “Driving high-value data”.
Therefore, we will still have to wait for some time until all the studies and consultations planned are completed in order to finally know in detail what will be the high-value data of mandatory publication in the European Union, although it will surely be with sufficient margin before finalizing the deadline for the Directive transposition in July 2021.
Content prepared by Carlos Iglesias, Open data Researcher and consultan, World Wide Web Foundation.
Contents and points of view expressed in this publication are the exclusive responsibility of its author.
The 9th edition of the Aporta Meeting is already underway. The appointment will be on December 18 in Madrid, in the morning (from 9:00 a.m. to 2:30 p.m.), and will be focused on high-value data.
High-value public and private data mean an extraordinary source of information to consider due to its great impact on citizens. When we talk about high-value data we refer to those categories delimited by Directive (EU) 2019/1024, of June 20, 2019, related to the geospatial, environmental, meteorological, statistical, commercial and mobility categories. This type of data are a key element to boost innovative services and generate socio-economic and environmental benefits for the entire population.
The relevance and interest of the community for high-value data has led to consider them the main axis of the new edition of the Aporta Meeting. Under the slogan "Driving high-value data", the challenges and opportunities that we will have to face in order to take advantage of all the value of this type of data will be addressed.
The event will be structured in 3 colloquium tables, each one focused on different actors linked to the data ecosystem: high-value data publishers, accelerators that try to boost their reuse, and companies that generate high-value services and products based on reuse.
- Table 1: Towards the availability of high-value data. The first table will be formed by representatives of the public administrations generating high-value data. The objective is to analyse which data sets are already available and their potential applications, as well as which ones should be opened to respond to user demand and under what conditions: automated readable formats, downloadable through application programming interfaces (API) and in a massive way, with the granularity and necessary formats, and based on the appropriate licenses.
- Table 2: Accelerating the use of high-value data. Table two will be a meeting point for projects aimed at boosting the European data-based entrepreneurship ecosystem. To this end, we have invite representatives of business accelerators and initiatives whose common denominator is to contribute to overcome the barriers faced by SMEs and data start-ups, in order to achieve success in the market.
- Table 3. New technological paradigms and the importance of data for their development. The last table will have agents from the reuse sector that will discuss the opportunities offered by the availability of high-value data and the challenges that need to be faced to encourage its use.
The full agenda will be available in the coming weeks. You can follow the event news in social networks, with the hashtag #Aporta2019 and in datos.gob.es.