Blog

In the digital age, technological advancements have transformed the field of medical research. One of the factors contributing to technological development in this area is data, particularly open data. The openness and availability of information obtained from health research provide multiple benefits to the scientific community. Open data in the healthcare sector promotes collaboration among researchers, accelerates the validation process of study results, and ultimately helps save lives.

The significance of this type of data is also evident in the prioritized intention to establish the European Health Data Space (EHDS), the first common EU data space emerging from the European Data Strategy and one of the priorities of the Commission for the 2019-2025 period. As proposed by the European Commission, the EHDS will contribute to promoting better sharing and access to different types of health data, not only to support healthcare delivery but also for health research and policymaking.

However, the handling of this type of data must be appropriate due to the sensitive information it contains. Personal data related to health is considered a special category by the Spanish Data Protection Agency (AEPD), and a personal data breach, especially in the healthcare sector, has a high personal and social impact.

To avoid these risks, medical data can be anonymized, ensuring compliance with regulations and fundamental rights, thereby protecting patient privacy. The Basic Anonymization Guide developed by the AEPD based on the Personal Data Protection Commission Singapore (PDPC) defines key concepts of an anonymization process, including terms, methodological principles, types of risks, and existing techniques.

Once this process is carried out, medical data can contribute to research on diseases, resulting in improvements in treatment effectiveness and the development of medical assistance technologies. Additionally, open data in the healthcare sector enables scientists to share information, results, and findings quickly and accessibly, thus fostering collaboration and study replicability.

In this regard, various institutions share their anonymized data to contribute to health research and scientific development. One of them is the FISABIO Foundation (Foundation for the Promotion of Health and Biomedical Research of the Valencian Community), which has become a reference in the field of medicine thanks to its commitment to open data sharing. As part of this institution, located in the Valencian Community, there is the FISABIO-CIPF Biomedical Imaging Unit, which is dedicated, among other tasks, to the study and development of advanced medical imaging techniques to improve disease diagnosis and treatment.

This research group has developed different projects on medical image analysis. The outcome of all their work is published under open-source licenses: from the results of their research to the data repositories they use to train artificial intelligence and machine learning models.

To protect sensitive patient data, they have also developed their own techniques for anonymizing and pseudonymizing images and medical reports using a Natural Language Processing (NLP) model, whereby anonymized data can be replaced by synthetic values. Following their technique, facial information from brain MRIs can be erased using open-source deep learning software.

BIMCV: Medical Imaging Bank of the Valencian Community

One of the major milestones of the Regional Ministry of Universal Health and Public Health, through the Foundation and the San Juan de Alicante Hospital, is the creation and maintenance of the Medical Imaging Bank of the Valencian Community, BIMCV (Medical Imaging Databank of the Valencia Region in English), a repository of knowledge aimed at achieving "technological advances in medical imaging and providing technological coverage services to support R&D projects," as explained on their website.

BIMCV is hosted on XNAT, a platform that contains open-source images for image-based research and is accessible by prior registration and/or on-demand. Currently, the Medical Imaging Bank of the Valencian Community includes open data from research conducted in various healthcare centers in the region, housing data from over 90,000 subjects collected in more than 150,000 sessions.

New Dataset of Radiological Images

Recently, the FISABIO-CIPF Biomedical Imaging Unit and the Prince Felipe Research Center (FISABIO-CIPF) released in open access the third and final iteration of data from the BIMCV-COVID-19 project. They released image data of chest radiographs taken from patients with and without COVID-19, as well as the models they had trained for the detection of different chest X-ray pathologies, thanks to the support of the Regional Ministry of Innovation, the Regional Ministry of Health and the European Union REACT-EU Funds. All of this was made available "for use by companies in the sector or simply for research purposes," explains María de la Iglesia, director of the unit. "We believe that reproducibility is of great relevance and importance in the healthcare sector," she adds. The datasets and the results of their research can be accessed here.

The findings are mapped using the standard terminology of the Unified Medical Language System (UMLS), as proposed by the results of Dr. Aurelia Bustos' doctoral thesis, an oncologist and computer engineer. They are stored in high resolution with anatomical labels in a Medical Image Data Structure (MIDS) format. Among the stored information are patient demographic data, projection type, and imaging study acquisition parameters, among others, all anonymized.

The contribution that such open data projects make to society not only benefits researchers and healthcare professionals but also enables the development of solutions that can have a significant impact on improving healthcare. One of these solutions can be generative AI, which provides interesting results that healthcare professionals can consider in personalized diagnosis and propose more effective treatment, prioritizing their own judgment.

On the other hand, the digitization of healthcare systems is already a reality, including 3D printing, digital twins applied to medicine, telemedicine consultations, or portable medical devices. In this context, the collaboration and sharing of medical data, provided their protection is ensured, contribute to promoting research and innovation in the sector. In other words, open data initiatives for medical research stimulate technological advancements in healthcare.

Therefore, the FISABIO Foundation, together with the Prince Felipe Research Center, where the platform hosting BIMCV is located, stands out as an exemplary case in promoting the openness and sharing of data in the field of medicine. As the digital age progresses, it is crucial to continue promoting data openness and encouraging its responsible use in medical research, for the benefit of society.

calendar icon
Blog

The Hercules initiative was launched in November 2017, through an agreement between the University of Murcia and the Ministry of Economy, Industry and Competitiveness, with the aim of developing a Research Management System (RMS) based on semantic open data that offers a global view of the research data of the Spanish University System (SUE), to improve management, analysis and possible synergies between universities and the general public.

This initiative is complementary to UniversiDATA, where several Spanish universities collaborate to promote open data in the higher education sector by publishing datasets through standardised and common criteria. Specifically, a Common Core is defined with 42 dataset specifications, of which 12 have been published for version 1.0. Hercules, on the other hand, is a research-specific initiative, structured around three pillars:

  • Innovative SGI prototype
  • Unified knowledge graph (ASIO) 1],
  • Data Enrichment and Semantic Analysis (EDMA)

The ultimate goal is the publication of a unified knowledge graph integrating all research data that participating universities wish to make public. Hercules foresees the integration of universities at different levels, depending on their willingness to replace their RMS with the Hercules RMS. In the case of external RMSs, the degree of accessibility they offer will also have an impact on the volume of data they can share through the unified network.

General organisation chart of the Hercule initiative

General organisation chart of the Hercules initiative. It represents how the Hercules semantic layer can be connected to Hercules-aligned universities and other universities. Hercules-aligned universities can be of 3 types. 1. Universities willing to change their IMS, which will be connected to the Hercules IMS. 2. Universities not willing to change their IMS, but with an accessible IMS, which will use tightly coupled connectors. Universities not willing to change their IMS, with an inaccessible IMS, which will use loosely coupled connectors. The other universities not aligned with Hercules will use loosely coupled connectors.

Within the Hercules initiative, the ASIO Project (Semantic Architecture and Ontology Infrastructure) is integrated. The purpose of this sub-project is to define an Ontology Network for Research Management (Ontology Infrastructure). An ontology is a formal definition that describes with fidelity and high granularity a particular domain of discussion. In this case, the research domain, which can be extrapolated to other Spanish and international universities (at the moment the pilot is being developed with the University of Murcia). In other words, the aim is to create a common data vocabulary.

Additionally, through the Semantic Data Architecture module, an efficient platform has been developed to store, manage and publish SUE research data, based on ontologies, with the capacity to synchronise instances installed in different universities, as well as the execution of distributed federated queries on key aspects of scientific production, lines of research, search for synergies, etc.

As a solution to this innovation challenge, two complementary lines have been proposed, one centralised (synchronisation in writing) and the other decentralised (synchronisation in consultation). The architecture of the decentralised solution is explained in detail in the following sections.

Domain Driven Design

The data model follows the Domain Driven Design approach, modelling common entities and vocabulary, which can be understood by both developers and domain experts. This model is independent of the database, the user interface and the development environment, resulting in a clean software architecture that can adapt to changes in the model.

This is achieved by using Shape Expressions (ShEx), a language for validating and describing RDF datasets, with human-readable syntax. From these expressions, the domain model is automatically generated and allows orchestrating a continuous integration (CI) process, as described in the following figure.

Continuous integration process using Domain Driven Design (just available in Spanish)

Graphic showing the continuous integration process using Domain Driven Design.  From Shape Expressions the domain model is generated automatically, in POJO format, which also allows orchestrating a continuous integration (CI) process, integrating and validating the changes as soon as possible.

By means of a system based on version control as a central element, it offers the possibility for domain experts to build and visualise multilingual ontologies. These in turn rely on ontologies both from the research domain: VIVO, EuroCRIS/CERIF or Research Object, as well as general purpose ontologies for metadata tagging: Prov-O, DCAT, etc.

Linked Data Platform

The linked data server is the core of the architecture, in charge of rendering information about all entities. It does this by collecting HTTP requests from the outside and redirecting them to the corresponding services, applying content negotiation, which provides the best representation of a resource based on browser preferences for different media types, languages, characters and encoding.

All resources are published following a custom-designed persistent URI scheme. Each entity represented by a URI (researcher, project, university, etc.) has a series of actions to consult and update its data, following the patterns proposed by the Linked Data Platform (LDP)  and the 5-star model.

This system also ensures compliance with the FAIR (Findable, Accessible, Interoperable, Reusable) principles and automatically publishes the results of applying these metrics to the data repository.

Open data publication

The data processing system is responsible for the conversion, integration and validation of third-party data, as well as the detection of duplicates, equivalences and relationships between entities. The data comes from various sources, mainly the Hercules unified RMS, but also from alternative RMSs, or from other sources offering data in FECYT/CVN (Standardised Curriculum Vitae), EuroCRIS/CERIF and other possible formats.

The import system converts all these sources to RDF format and registers them in a specific purpose repository for linked data, called Triple Store, because of its capacity to store subject-predicate-object triples.

Once imported, they are organised into a knowledge graph, easily accessible, allowing advanced searches and inferences to be made, enhanced by the relationships between concepts.

 

Example of a knowledge network describing the ASIO project

Example of a knowledge network describing the ASIO project

Results and conclusions

The final system not only allows to offer a graphical interface for interactive and visual querying of research data, but also to design SPARQL queries, such as the one shown below, even with the possibility to run the query in a federated way on all nodes of the Hercules network, and to display results dynamically in different types of graphs and maps.

In this example, a query is shown (with limited test data) of all available research projects grouped graphically by year:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?year (COUNT(?x) as ?cuenta)

WHERE {

         ?x <http://www.w3.org/1999/02/22-rdf-syntax-nes#type> <https://ldpld1test.um.es/um/es-ES/rec/Project> .

         ?x <https://ldpld1test.um.es/um/es-ES/rec/startDate> ?d BIND(SUBSTR(?d, 1, 4) as ?year) .

} GROUP BY ?year LIMIT 20

LIMIT 20

Ejemplo de consulta SPARQL con resultado gráfico

Example of the SPARQL query detailed above with its graphical output

In short, ASIO offers a common framework for publishing linked open data, offered as open source and easily adaptable to other domains. For such adaptation, it would be enough to design a specific domain model, including the ontology and the import and validation processes discussed in this article.

Currently the project, in its two variants (centralised and decentralised), is in the process of being put into pre-production within the infrastructure of the University of Murcia, and will soon be publicly accessible.


[1 Graphs are a form of knowledge representation that allow concepts to be related through the integration of data sets, using semantic web techniques. In this way, the context of the data can be better understood, which facilitates the discovery of new knowledge.


Content prepared by Jose Barranquero, expert in Data Science and Quantum Computing.

The contents and views expressed in this publication are the sole responsibility of the author.

calendar icon
Evento

Nowadays, research methods (sensors, technological devices, simulations, etc.) generate a large amount of data that, in open and reusable format, conceals a great re-use potential for other researchers, public administrations, private companies or users. That is why last year the European Commission agreed on an international commitment to promote open science during the Competitiveness Council, ensuring that all results of EU research are available without any technical, legal or financial constraints.

Nevertheless, open access and the dissemination of scientific information involves a series of legal challenges - intellectual property, privacy and personal data protection - that require ad hoc solutions to make the open science movement viable. In this context, the eighth edition of the OpenAIRE workshop will be held on Tuesday 4 April in the framework of the Research Data Alliance (RDA) plenary in Barcelona, dedicated to exploring the legal barriers that hinder the open research data and identifying possible solutions.

During the event attendees will be familiarized with those normative aspects directly related to the open research data, providing specific and pragmatic recommendations so such barriers do not hinder the opennes and re-use of scientific information by the experts of OpenAIRE.

In this project, funded by the European Commission, work is done to encourage and promote open research and optimize the access to European scientific data, while offering a network of repositories for free access to knowledge and research results on health, energy, environment, ICT and social sciences. At the same time, OpenAIRE organizes different workshops, such as the event that will take place in the city of Barcelona, to raise awareness about open science, its opportunities and challenges.

calendar icon
Evento

 

“Crowd-sourcing questions that if answered could radically increase our understanding of open data”

On October 5th, international researchers will gather at the second Open Data Research Symposium (ODRS); a pre-event to the International Open Data Conference to be held in Madrid. As in the previous edition, ODRS 16 will offer attendees the opportunity to reflect critically on the results of their investigations while cohesion is sought within the research community about the potential impacts of open data.

Though the ODRS call for proposals ended last May, the deadline has been extended to all members of the open data movement to help shape the program of the event, focusing on the most relevant aspects in the field. To do this, the organization has created a specific section on the Symposium website where users can submit questions for researchers to resolve their doubts about open data. Moreover, it is also possible to send the questions via Twitter using the hashtag #ODSR16. The deadline is July 1st.

Thanks to user’s questions, it will be possible to identify the topics of interest to the international open data community, draft the ODRS program to ensure sessions are tailored to the needs of the participants, build a collaborative agenda and report efforts and collaborations that take place during the meeting.

More information about the pre-events to the annual open data meeting? Stay tuned to the website of the International Open Data Conference. See you in Madrid!

calendar icon
Noticia

 

 

The Superior Council for Scientific Research (CSIC) and six Spanish universities have worked together to launch Maredata, a national thematic network which stores open research data for being reused by other stakeholders interested in scientific information.

The University of Barcelona is the centre responsible for coordinating this project, in which the following academic institutions take part: the Carlos III University, the University of Alicante, the Open University of Catalonia, the Institute of Agricultural Chemistry and Food Technology (IATA-CSIC), Ingenio and UISYS, CSIC entities attached to the University of Valencia.

Financed by the Ministry of Economy and Competitiveness, the network boosts open science in Spain, guaranteeing unrestricted access to research data from public funding and promoting the participation of the national community in the Pilot Project on Open Research Data sponsored by Horizon 2020. Moreover, two of the goals of this initiative are the development of new research areas and the collaboration between open data and open science communities, making recommendations to help other institutions open their scientific results.

Open science makes the results from scientific research and methodologies as well as data obtained from them be distributed, reused and accessible for free. Creating new business models, building products or services, developing new investigation lines or even validating the quality of science produced in the country are just a few examples of the potential and benefits that research data provide for all levels of citizenship.

For these reasons, government agents want to value this information and promote Maredata thematic network. After all, universities and research centers are public entities and play an essential role in society; so they should be consolidated as open bodies which, thanks to the data stored and produced in different academic fields (teaching, research ...), will help increase awareness and public participation.

Aware of the relevance of academic data, the Spanish university has already several institutions which, through their own open data sites, make accessible their information to the society and publish their action plan as the University of Alicante has done, which, apart from participating in Maredata network, has published a book about its open data ecosystem that shows  the experience of the University of Alicante in the process of implementing its open data policy.

Thanks to these initiatives, the open university and the access to research data are already a reality; which not only supports the sharing and reuse of information but facilitates making decisions based on academic information and generates opportunities for collaboration among the different society actors. 

calendar icon