Big Data Test Infrastructure: A free environment for public administrations to experiment with open data

Fecha de la noticia: 26-03-2024

Photography of stock of a computer

The   Big Data Test Infrastructure (BDTI) is a tool funded by the European Digital Agenda, which enables public administrations to perform analysis with open data and open source tools in order to drive innovation.

This free-to-use, cloud-based tool was created in 2019 to accelerate digital and social transformation. With this approach and also following the European Open Data Directive, the European Commission concluded that in order to achieve a digital and economic boost, the power of public administrations' data should be harnessed, i.e. its availability, quality and usability should be increased. This is how BDTI was born, with the purpose of encouraging the reuse of this information by providing a free analysis test environment that allows public administrations to prototype solutions in the cloud before implementing them in the production environment of their own facilities.

What tools does BDTI offer?

Big Data Test Infrastructure offers European public administrations a set of standard open source tools for storing, processing and analysing their data. The platform consists of virtual machines, analysis clusters, storage and network facilities. The tools it offers are:

  • Databases: to store data and perform queries on the stored data. The BDTI currently includes a relational database(PostgreSQL), a document-oriented database(MongoDB) and a graph database(Virtuoso).
  • Data lake: for storing large amounts of structured and unstructured data (MinIO). Unstructured raw data can be processed with deployed configurations of other building blocks (BDTI components) and stored in a more structured format within the data lake solution.
  • Development environments: provide the computing capabilities and tools necessary to perform standard data analysis activities on data from external sources, such as data lakes and databases.
    • JupyterLab, an interactive, online development environment for creating Jupyter notebooks, code and data.
    • Rstudio, an integrated development environment for R, a programming language for statistical computing and graphics.
    • KNIME, an open source data integration, reporting and analytics platform with machine learning and data mining components, can be used for the entire data science lifecycle.
    • H2O.ai, an open sourcemachine learning ( ML) and artificial intelligence (AI) platform designed to simplify and accelerate the creation, operation and innovation with ML and AI in any environment.
  • Advanced processing: clusters and tools can also be created to process large volumes of data and perform real-time search operations(Apache Spark, Elasticsearch and Kibana)
  • Display: BDTI also offers data visualisation applications such as Apache Superset, capable of handling petabyte-scale data, or Metabase.
  • Orchestration: for the automation of data-driven processes throughout their lifecycle, from preparing data to making data-driven decisions and taking actions based on those decisions, is offered:
    • Apache Airflow, an open source workflow management platform that allows complex data pipelines to be easily scheduled and executed.

Through these cloud-based tools, public workers in EU countries can create their own pilot projects to demonstrate the value that data can bring to innovation. Once the project is completed, users have the possibility to download the source code and data to continue the work themselves, using environments of their choice. In addition, civil society, academia and the private sector can participate in these pilot projects, as long as there is a public entity involved in the use case.

Success stories

These resources have enabled the creation of various projects in different EU countries. Some examples of use cases can be found on the BDTI website. For example, Eurostat carried out a pilot project using open data from internet job advertisements to map the situation of European labour markets. Other success stories included the optimisation of public procurement by the Norwegian Agency for Digitisation, data sharing efforts by the European Blood Alliance and work to facilitate understanding of the impact of COVID-19 on the city of Florence .

In Spain, BDTI enabled a data mining project atthe  Conselleria de Sanitat de la Comunidad Valenciana. Thanks to BDTI, knowledge could be extracted from the enormous amount of scientific clinical articles, a task that supported clinicians and managers in their clinical practices and daily work.

OVERVIEW OF BDTI SUCCESS STORIES   Conselleria de Sanitat (Generalitat Valenciana): Text Mining   Extract knowledge from the huge quantity ofd scientific clinical articles, spporting clinicians and managers in their clinical practices and day-to-day work.   Norwegian Digital Agency (Digdir): Optimisation   Optimise public procurement in Norway, by gathering and analysing big datasets on transactions in this area.   European Blood Alliance: Data sharing   A ready-to-use virtual enviroment in which data collected though a custom-built website are ingested and anonymized, to be then analysed with advanced data visualization and analytical tools.   City of Florence: Mobility data   Predictive, descriptive and time-series analysis on multiple datasets collected before, aduring and after covid-19 pandemic such are public WiFi sensors, sharing and geo-referenced data of people  movement.   Eurostat, European Centre for Development of Vocational Training National Statistical Institutes: Labour market intelligence   Use online job advertisement data to provide timely information about the EU labour markets, application of Artificial Intelligence, Natural Language Processing and Machine Learning to clean text and extract the relevant data.

Courses, newsletter and other resources

 In addition to publishing use cases, theBig Data Test Infrastructure website offers an free online course to learn how to get the most out of BDTI. This course focuses on a highly practical use case: analysing the financing of green projects and initiatives in polluted regions of the EU, using open data from data.europa.eu and other open sources.

In addition, a monthly  newsletter on the latest BDTI news, best practices and data analytics opportunities for the public sector has recently been launched .

In short, the re-use of public sector data (RISP) is a priority for the European Commission and BDTI(Big Data Test Infrastructure) is one of the tools contributing to its development. If you work in the public administration and you are interested in using BDTI register here.