Blog

What is data profiling?

Data profiling is the set of activities and processes aimed at determining the metadata about a particular dataset. This process, considered as an indispensable technique during exploratory data analysis, includes the application of different statistics with the main objective of determining aspects such as the number of null values, the number of distinct values in a column, the types of data and/or the most frequent patterns of data values. Its ultimate goal is to provide a clear and detailed understanding of the structure, content and quality of the data, which is essential prior to its use in any application.

Types of data profiling

There are different alternatives in terms of the statistical principles to be applied during data profiling, as well as their typology. For this article, a review of various approaches by different authors has been carried out. On this basis, it is decided to focus the article on the typology of data profiling techniques on three high-level categories: single-column profiling, multi-column profiling and dependency profiling. For each category, possible techniques and uses are identified, as discussed below.

 

 

More detail on each of the categories and the benefits they bring is presented below.

1. Profiling of a column

Single-column profiling focuses on analysing each column of a dataset individually. This analysis includes the collection of descriptive statistics such as:

  • Count of distinct values, to determine the exact number of unique records in a list and to be able to sort them. For example, in the case of a dataset containing grants awarded by a public body, this task will allow us to know how many different beneficiaries there are for the beneficiaries column, and whether any of them are repeated.

  • Distribution of values (frequency), which refers to the analysis of the frequency with which different values occur within the same column. This can be represented by histograms that divide the values into intervals and show how many values are in each interval. For example, in an age column, we might find that 20 people are between 25-30 years old, 15 people are between 30-35 years old, and so on.

  • Counting null or missing values, which involves counting the number of null or empty values in each column of a dataset. It helps to determine the completeness of the data and can point to potential quality problems. For example, in a column of email addresses, 5 out of 100 records could be empty, indicating 5% missing data.

  • Minimum, maximum and average length of values (for text columns), which is oriented to calculate what is the length of the values in a text column. This is useful for identifying unusual values and for defining length restrictions in databases. For example, in a column of names, we might find that the shortest name is 3 characters, the longest is 20 characters, and the average is 8 characters.

The main benefits of using this data profiling include:

  • Anomaly detection: allows the identification of unusual or out-of-range values.
  • Improved data preparation: assists in normalising and cleaning data prior to use in more advanced analytics or machine learningmodels.

2. Multi-column profiling

Multi-column profiling analyses the relationship between two or more columns within the same data set. This type of profiling may include:

  • Correlation analysis, used to identify relationships between numerical columns in a data set. A common technique is to calculate pairwise correlations between all numerical columns to discover patterns of relationships. For example, in a table of researchers, we might find that age and number of publications are correlated, indicating that as the age of researchers and their category increases, their number of publications tends to increase. A Pearson correlation coefficient could quantify this relationship.

  • Outliers, which involves identifying data that deviate significantly from other data points.  Outliers may indicate errors, natural variability or interesting data points that merit further investigation. For example, in a column of budgets for annual R&D projects, a value of one million euros could be an outlier if most of the income is between 30,000 and 100,000 euros. However, if the amount is represented in relation to the duration of the project, it could be a normal value if the 1 million project has 10 times the duration of the 100,000 euro project.
  • Frequent value combination detection, focused on finding sets of values that occur together frequently in the data. They are used to discover associations between elements, as in transaction data. For example, in a shopping dataset, we might find that the products "nappies" and "baby formula" are frequently purchased together. An association rule algorithm could generate the rule {breads} → {formula milk}, indicating that customers who buy bread also tend to buy butter with a high probability.

The main benefits of using this data profiling include:

  • Trend detection: allows the identification of significant relationships and correlations between columns, which can help in the detection of patterns and trends.
  • Improved data consistency: ensures that there is referential integrity and that, for example, similar data type formats are followed between data across multiple columns.
  • Dimensionality reduction: allows to reduce the number of columns containing redundant or highly correlated data.

3. Profiling of dependencies

Dependency profiling focuses on discovering and validating logical relationships between different columns, such as:

  • Foreign key discovery, which is aimed at establishing which values or combinations of values from one set of columns also appear in the other set of columns, a prerequisite for a foreign key. For example, in the Investigator table, the ProjectID column contains the values [101, 102, 101, 103]. To set ProjectID as a foreign key, we verify that these values are also present in the ProjectID column of the Project table [101, 102, 103]. As all values match, ProjectID in Researcher can be a foreign key referring to ProjectID in Project.
  • Functional dependencies, which establishes relationships in which the value of one column depends on the value of another. It is also used for the validation of specific rules that must be complied with (e.g. a discount value must not exceed the total value).

The main benefits of using this data profiling include:

  • Improved referential integrity: ensures that relationships between tables are valid and maintained correctly.
  • Consistency validation between values: allows to ensure that the data comply with certain constraints or calculations defined by the organisation.
  • Data repository optimisation: allows to improve the structure and design of databases by validating and adjusting dependencies.

Uses of data profiling

The above-mentioned statistics can be used in many areas in organisations. One use case would be in data science and data engineering  initiatives where it allows for a thorough understanding of the characteristics of a dataset prior to analysis or modelling.

  • By generating descriptive statistics, identifying outliers and missing values, uncovering hidden patterns, identifying and correcting problems such as null values, duplicates and inconsistencies, data profiling facilitates data cleaning and data preparation, ensuring data quality and consistency.
  • It is also crucial for the early detection of problems, such as duplicates or errors, and for the validation of assumptions in predictive analytics projects.
  • It is also essential for the integration of data from multiple sources, ensuring consistency and compatibility.
  • In the area of data governance, management and quality, profiling can help establish sound policies and procedures, while in compliance it ensures that data complies with applicable regulations.
  • Finally, in terms of management, it helps optimiseExtract, Transform and Load ( ETL) processes, supports data migration between systems and prepares data sets for machine learning and predictive analytics, improving the effectiveness of data-driven models and decisions.

Difference between data profiling and data quality assessment

This term data profiling is sometimes confused with data quality assessment. While data profiling focuses on discovering and understanding the metadata and characteristics of the data, data quality assessment goes one step further and focuses for example on analysing whether the data meets certain requirements or quality standards predefined in the organisation through business rules. Likewise, data quality assessment involves verifying the quality value for different characteristics or dimensions such as those included in the UNE 0081 specification: accuracy, completeness, consistency or timeliness, etc., and ensuring that the data is suitable for its intended use in the organisation: analytics, artificial intelligence, business intelligence, etc.

Data profiling tools or solutions

Finally, there are several outstanding open source solutions (tools, libraries, or dependencies) for data profiling that facilitate the understanding of the data. These include:

  • Pandas Profiling and YData Profiling offering detailed reporting and advanced visualisations in Python
  • Great Expectations and Dataprep to validate and prepare data, ensuring data integrity throughout the data lifecycle
  • R dtables that allows the generation of detailed reports and visualisations for exploratory analysis and data profiling for the R ecosystem.

In summary, data profiling is an important part of exploratory data analysis that provides a detailed understanding of the structure, contents, etc. and is recommended to be taken into account in data analysis initiatives. It is important to dedicate time to this activity, with the necessary resources and tools, in order to have a better understanding of the data being handled and to be aware that it is one more technique to be used as part of data quality management, and that it can be used as a step prior to data quality assessment


Content elaborated by Dr. Fernando Gualo, Professor at UCLM and Data Governance and Quality Consultant The content and the point of view reflected in this publication are the sole responsibility of its author.

calendar icon
Noticia

The unstoppable advance of ICTs in cities and rural territories, and the social, economic and cultural context that sustains it, requires skills and competences that position us advantageously in new scenarios and environments of territorial innovation. In this context, the Provincial  Council of Badajoz has been able to adapt and anticipate the circumstances, and in 2018 it launched the initiative "Badajoz Es Más - Smart Provincia".

What is "Badajoz Es Más"?

The project "Badajoz Is More" is an initiative carried out by the Provincial Council of Badajoz with the aim of achieving more efficient services, improving the quality of life of its citizens and promoting entrepreneurship and innovation through technology and data governance in a region made up of 135 municipalities. The aim is to digitally transform the territory, favouring the creation of business opportunities, social improvement andsettlement of the population.

Traditionally, "Smart Cities" projects have focused their efforts on cities, renovation of historic centres, etc. However, "Badajoz Es Más" is focused on the transformation of rural areas, smart towns and their citizens, putting the focus on rural challenges such as depopulation of rural municipalities, the digital divide, talent retention or the dispersion of services. The aim is to avoid isolated "silos" and transform these challenges into opportunities by improving information management, through the exploitation of data in a productive and efficient way.

Citizens at the Centre

The "Badajoz es Más" project aims to carry out the digital transformation of the territory by making available to municipalities, companies and citizens the new technologies of IoT, Big Data, Artificial Intelligence, etc. The main lines of the project are set out below.

Provincial Platform for the Intelligent Management of Public Services

It is the core component of the initiative, as it allows for the integration of information from any IoT device, information system or data source in one place for storage, visualisation and in a single place for storage, visualisation and analysis. Specifically, data is collected from a variety of sources: the various sensors of smart solutions deployed in the region, web services and applications, citizen feedback and social networks.

All information is collected on a based on the open source standard FIWARE an initiative promoted by the European Commission that provides the capacity to homogenise data (FIWARE Data Model) and favour its interoperability. Built according to the guidelines set by AENOR (UNE 178104), it has a central module Orion Context Broker (OCB) which allows the entire information life cycleto be managed. In this way, it offers the ability to centrally monitor and manage a scalable set of public services through internal dashboards. 

The  platform is "multi-entity", i.e. it provides information, knowledge and services to both the Provincial Council itself and its associated Municipalities (also known as "Smart Villages"). The visualisation of the different information exploitation models processed at the different levels of the Platform is carried out on different dashboards, which can provide service to a specific municipality or locality only showing its data and services, or also provide a global view of all the services and data at the level of the Provincial Council of Badajoz.

Some of the information collected on the platform is also made available to third parties through various channels:

  • Portal of open dopen data portal. Collected data that can be opened to third parties for reuse is shared through its open data portal. In it we can find information as diverse as real time data on the beaches with blue flags blue flag beaches in the region (air quality, water quality, noise pollution, capacity, etc. are monitored) or traffic flow, which makes it possible to predict traffic jams.
  • Portal for citizens Digital Province Badajoz. This portal offers information on the solutions currently implemented in the province and their data in real time in a user-friendly way, with a simple user experience that allows non-technical people to access the projects developed.

The following graph shows the cycle of information, from its collection, through the platform and distribution to the different channels. All this under strong data governance.

Efficient public services

In addition to the implementation and start-up of the Provincial Platform for the Intelligent Management of Public Services, this project has already integrated various existing services or "verticals" for:

  • To start implementing these new services in the province and to be the example and the "spearhead" of this technological transformation.

  • Show the benefits of the implementation of these technologies in order to disseminate and demonstrate them, with the aim of causing sufficient impact so that other local councils and organisations will gradually join the initiative.

There are currently more than 40 companies sending data to the Provincial Platform, more than 60 integrated data sources, more than 800 connected devices, more than 500 transactions per minute... It should be noted that work is underway to ensure that the new calls for tender include a clause so that data from the various works financed with public money can also be sent to the platform.

The idea is to be able to standardise management, so that the solution that has been implemented in one municipality can also be used in another. This not only improves efficiency, but also makes it possible to compare results between municipalities. You can visualise some of the services already implemented in the Province, as well as their Dashboards built from the Provincial Platform at this video.

Innovation Ecosystem

In order for the initiative to reach its target audience, the Provincial Council of Badajoz has developed an innovation ecosystem that serves as a meeting point for the Badajoz Provincial Council:

  • Citizens, who demand these services.

  • Entrepreneurs and educational entities, which have an interest in these technologies.

  • Companies, which have the capacity to implement these solutions.

  • Public entities, which can implement this type of project.

The aim is to facilitate and provide the necessary tools, knowledge and advice so that the projects that emerge from this meeting can be carried out.

At the core of this ecosystem is a physical innovation centre called the FIWARE Space. FIWARE Space carries out tasks such as the organisation of events for the dissemination of Smart technologies and concepts among companies and citizens, demonstrative and training workshops, Hackathons with universities and study centres, etc. It also has a Showroom for the exhibition of solutions, organises financially endowed Challenges and is present at national and international congresses.

In addition, they carry out mentoring work for companies and other entities. In total, around 40 companies have been mentored by FIWARE Space, launching their own solutions on several occasions on the FIWARE Market, or proposing the generated data models as standards for the entire global ecosystem. These companies are offered a free service to acquire the necessary knowledge to work in a standardised way, generating uniform data for the rest of the region, and to connect their solutions to the platform, helping and advising them on the challenges that may arise.

One of the keys to FIWARE Space is its open nature, having signed many collaboration agreements and agreements with both local and international entities. For example, work on the standardisation of advanced data models for tourism is ongoing with the Future Cities Institute (Argentina). For those who would like more information, you can follow your centre's activity through its weekly blog.

Next steps: convergence with Data Spaces and Gaia-X

As a result of the collaborative and open nature of the project, the Data Space concept fits perfectly with the philosophy of "Badajoz is More". The Badajoz Provincial Council currently has a multitude of verticals with interesting information for sharing (and future exploitation) of data in a reliable, sovereign and secure way. As a Public Entity, comparing and obtaining other sources of data will greatly enrich the project, providing an external view that is essential for its growth. Gaia-X is the proposal for the creation of a data infrastructure for Europe, and it is the standard towards which the "Badajoz es Más" project is currently converging, as a result of its collaboration with the gaia-X Spain hub.

calendar icon
Noticia

Today, 23 April, is World Book Day, an occasion to highlight the importance of reading, writing and the dissemination of knowledge. Active reading promotes the acquisition of skills and critical thinking by bringing us closer to specialised and detailed information on any subject that interests us, including the world of data. 

Therefore, we would like to take this opportunity to showcase some examples of books and manuals regarding data and related technologies that can be found on the web for free. 

1. Fundamentals of Data Science with R, edited by Gema Fernandez-Avilés and José María Montero (2024) 

Access the book here.

  • What is it about? The book guides the reader from the problem statement to the completion of the report containing the solution to the problem. It explains some thirty data science techniques in the fields of modelling, qualitative data analysis, discrimination, supervised and unsupervised machine learning, etc. It includes more than a dozen use cases in sectors as diverse as medicine, journalism, fashion and climate change, among others. All this, with a strong emphasis on ethics and the promotion of reproducibility of analyses. 
  • Who is it aimed at? It is aimed at users who want to get started in data science. It starts with basic questions, such as what is data science, and includes short sections with simple explanations of probability, statistical inference or sampling, for those readers unfamiliar with these issues. It also includes replicable examples for practice. 
  • Language: Spanish.  

2. Telling stories with data, Rohan Alexander (2023). 

Access the book here.

  • What is it about? The book explains a wide range of topics related to statistical communication and data modelling and analysis. It covers the various operations from data collection, cleaning and preparation to the use of statistical models to analyse the data, with particular emphasis on the need to draw conclusions and write about the results obtained. Like the previous book, it also focuses on ethics and reproducibility of results. 
  • Who is it aimed at? It is ideal for students and entry-level users, equipping them with the skills to effectively conduct and communicate a data science exercise. It includes extensive code examples for replication and activities to be carried out as evaluation. 
  • Language: English. 

3. The Big Book of Small Python Projects, Al Sweigart (2021) 

Access the book here.

  • What is it about? It is a collection of simple Python projects to learn how to create digital art, games, animations, numerical tools, etc. through a hands-on approach. Each of its 81 chapters independently explains a simple step-by-step project - limited to a maximum of 256 lines of code. It includes a sample run of the output of each programme, source code and customisation suggestions. 
  • Who is it aimed at?  The book is written for two groups of people. On the one hand, those who have already learned the basics of Python, but are still not sure how to write programs on their own.  On the other hand, those who are new to programming, but are adventurous, enthusiastic and want to learn as they go along. However, the same author has other resources for beginners to learn basic concepts. 
  • Language: English. 

4. Mathematics for Machine Learning, Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong (2024) 

Access the book here.

  • What is it about?  Most books on machine learning focus on machine learning algorithms and methodologies, and assume that the reader is proficient in mathematics and statistics. This book foregrounds the mathematical foundations of the basic concepts behind machine learning 
  • Who is it aimed at? The author assumes that the reader has mathematical knowledge commonly learned in high school mathematics and physics subjects, such as derivatives and integrals or geometric vectors. Thereafter, the remaining concepts are explained in detail, but in an academic style, in order to be precise. 
  • Language: English. 

5. Dive into Deep Learning, Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola (2021, continually updated) 

Access the book here.

  • What is it about? The authors are Amazon employees who use the mXNet library to teach Deep Learning. It aims to make deep learning accessible, teaching basic concepts, context and code in a practical way through examples and exercises. The book is divided into three parts: introductory concepts, deep learning techniques and advanced topics focusing on real systems and applications. 
  • Who is it aimed at?  This book is aimed at students (undergraduate and postgraduate), engineers and researchers, who are looking for a solid grasp of the practical techniques of deep learning. Each concept is explained from scratch, so no prior knowledge of deep or machine learning is required. However, knowledge of basic mathematics and programming is necessary, including linear algebra, calculus, probability and Python programming. 
  • Language: English. 

6. Artificial intelligence and the public sector: challenges, limits and means, Eduardo Gamero and Francisco L. Lopez (2024) 

Access the book here.

  • What is it about? This book focuses on analysing the challenges and opportunities presented by the use of artificial intelligence in the public sector, especially when used to support decision-making. It begins by explaining what artificial intelligence is and what its applications in the public sector are, and then moves on to its legal framework, the means available for its implementation and aspects linked to organisation and governance. 
  • Who is it aimed at? It is a useful book for all those interested in the subject, but especially for policy makers, public workers and legal practitioners involved in the application of AI in the public sector. 
  • Language: Spanish 

7. A Business Analyst’s Introduction to Business Analytics, Adam Fleischhacker (2024) 

Access the book here.

  • What is it about? The book covers a complete business analytics workflow, including data manipulation, data visualisation, modelling business problems, translating graphical models into code and presenting results to stakeholders. The aim is to learn how to drive change within an organisation through data-driven knowledge, interpretable models and persuasive visualisations. 
  • Who is it aimed at? According to the author, the content is accessible to everyone, including beginners in analytical work. The book does not assume any knowledge of the programming language, but provides an introduction to R, RStudio and the "tidyverse", a series of open source packages for data science. 
  • Language: English. 

We invite you to browse through this selection of books. We would also like to remind you that this is only a list of examples of the possibilities of materials that you can find on the web. Do you know of any other books you would like to recommend? let us know in the comments or email us at dinamizacion@datos.gob.es

calendar icon
Noticia

Between 2 April and 16 May, applications for the call on aid for the digital transformation of strategic productive sectors may be submitted at the electronic headquarters of the Ministry for Digital Transformation and Civil Service. Order TDF/1461/2023, of 29 December, modified by Order TDF/294/2024, regulates grants totalling 150 million euros for the creation of demonstrators and use cases, as part of a more general initiative of Sectoral Data Spaces Program, promoted by the State Secretary for Digitalisation and Artificial Intelligence and framed within the Recovery, Transformation and Resilience Plan (PRTR). The objective is to finance the development of data spaces and the promotion of disruptive innovation in strategic sectors of the economy, in line with the strategic lines set out in the Digital Spain Agenda 2026

Lines, sectors and beneficiaries

The current call includes funding lines for experimental development projects in two complementary areas of action: the creation of demonstration centres (development of technological platforms for data spaces); and the promotion of specific use cases of these spaces. This call is addressed to all sectors except tourism, which has its own call. Beneficiaries may be single entities with their own legal personality, tax domicile in the European Union, and an establishment or branch located in Spain. In the case of the line for demonstration centres, they must also be associative or representative of the value chains of the productive sectors in territorial areas, or with scientific or technological domains. 

Infographic-summary

The following infographics show the key information on this call for proposals:

Infographic "All you need to know: Sectoral Data Space Program"Infographic "Exploring the objective: Sectoral Data Space Program".

 

Would you like more information? 

calendar icon
Noticia

The Centre de documentació i biblioteca del Institut Català d'Arqueologia Clàssica (ICAC) has the repository Open Science ICAC. This website is a space where science is shared in an accessible and inclusive way. The space introduces recommendations and advises on the process of publishing content. Also, on how to make the data generated during the research process available for future research work.

The website, in addition to being a repository of scientific research texts, is also a place to find tools and tips on how to approach the research data management process in each of its phases: before, during and at the time of publication.

  • Before you begin: create a data management plan to ensure that your research proposal is as robust as possible. The Data Management Plan (DMP) is a methodological document that describes the life cycle of the data collected, generated and processed during a research project, a doctoral thesis, etc.
  • During the research process: at this point it points out the need to unify the nomenclature of the documents to be generated before starting to collect files or data, in order to avoid an accumulation of disorganised content that will lead to lost or misplaced data. In addition, this section provides information on directory structure, folder names and file names, the creation of a txt file (README) describing the nomenclatures or the use of short, descriptive names such as project name/acronym, file creation date, sample number or version number. Recommendations on how to structure each of these fields so that they are reusable and easily searchable can also be found on the website.
  • Publication of research data: in addition to the results of the research itself in the form of a thesis, dissertation, paper, etc., it recommends the publication of the data generated by the research process itself. The ICAC itself points out that research data remains valuable after the research project for which it was generated has ended, and that sharing data can open up new avenues of research without future researchers having to recreate and collect identical data. Finally, it outlines how, when and what to consider when publishing research data.

Graphical content for improving the quality of open data

Recently, the ICAC has taken a further step to encourage good practice in the use of open data. To this end, it has developed a series of graphic contents based on the "Practical guide for the improvement of the quality of open data"produced by datos.gob.es. Specifically, the cultural body has produced four easy-to-understand infographics, in Catalan and English, on good practices with open data in working with databases and spreadsheets, texts and docs and CSV format.

All the infographics resulting from the adaptation of the guide are available to the general public and also to the centre's research staff at Recercat, Catalonia's research repository. Soon it will also be available on the Open Science website of the Institut Català d'Arqueologia Clàssica (ICAC)open Science ICAC.

The infographics produced by the ICAC review various aspects. The first ones contain general recommendations to ensure the quality of open data, such as the use of standardised character encoding, such as UTF-8, or naming columns correctly, using only lowercase letters and avoiding spaces, which are replaced by hyphens. Among the recommendations for generating quality data, they also include how to show the presence of null or missing data or how to manage data duplication, so that data collection and processing is centralised in a single system so that, in case of duplication, it can be easily detected and eliminated.

The latter deal with how to set the format of thenumerical figures and other data such as dates, so that they follow the ISO standardised system, as well as how to use dots as decimals. In the case of geographic information, as recommended by the Guide, its materials also include the need to reserve two columns for inserting the longitude and latitude of the geographic points used.

The third theme of these infographics focuses on the development of good databases or spreadsheets databases or spreadsheetsso that they are easily reusable and do not generate problems when working with them. Among the recommendations that stand out are consistency in generating names or codes for each item included in the data collection, as well as developing a help guide for the cells that are coded, so that they are intelligible to those who need to reuse them.

In the section on texts and documents within these databases, the infographics produced by the Institut Català d'Arqueologia Clàssica include some of the most important recommendations for creating texts and ensuring that they are preserved in the best possible way. Among them, it points to the need to save attachments to text documents such as images or spreadsheets separately from the text document. This ensures that the document retains its original quality, such as the resolution of an image, for example.

Finally, the fourth infographic that has been made available contains the most important recommendations for working with CSV format working with CSV format (comma separated value) format, such as creating a CSV document for each table and, in the case of working with a document with several spreadsheets, making them available independently. It also notes in this case that each row in the CSV document has the same number of columns so that they are easily workable and reusable, without the need for further clean-up.

As mentioned above, all infographics follow the recommendations already included in the Practical guide for improving the quality of open data.

The guide to improving open data quality

The "Practical guide for improving the quality of open data" is a document produced by datos.gob.es as part of the Aporta Initiative and published in September 2022. The document provides a compendium of guidelines for action on each of the defining characteristics of quality, driving quality improvement. In turn, this guide takes the data.europe.eu data quality guide, published in 2021 by the Publications Office of the European Union, as a reference and complements it so that both publishers and re-users of data can follow guidelines to ensure the quality of open data.

In summary, the guide aims to be a reference framework for all those involved in both the generation and use of open data so that they have a starting point to ensure the suitability of data both in making it available and in assessing whether a dataset is of sufficient quality to be reused in studies, applications, services or other.

calendar icon
Noticia

At the end of 2023, as reported by datos.gob.es, the ISTAC made public more than 500 semantic assets, including 404 classifications or 100 concept schemes.  

All these resources are available in the Open Data Catalog of the Canary Islands, an environment in which there is room for both semantic and statistical resources and which, therefore, may involve an extra difficulty for a user looking only for semantic assets.  

To facilitate the reuse of these datasets with information so relevant to society, the Canary Islands Statistics Institute, with the collaboration of the Directorate General for the Digital Transformation of Public Services of the Canary Islands Government, published the Bank of Semantic Assets. 

In this portal, the user can perform searches more easily by providing a keyword, identifier, name of the dataset or institution that prepares and maintains it. 

 

The Bank of semantic assets of the Canary Islands Statistics Institute is an application that serves to explore the structural resources used by the ISTAC. In this way it is possible to reuse the semantic assets with which the ISTAC works, since it makes direct use of the eDatos APIs, the infrastructure that supports the Canary Islands statistics institute.  

The number of resources to be consulted increases enormously with respect to the data available in the Catalog, since, on the one hand, it includes the DSD (Data Structures Definitions), with which the final data tables are built; and, on the other hand, because it includes not only the schemes and classifications, but also each of the codes, concepts and elements that compose them.  

This tool is the equivalent of the aforementioned Fusion Metadata Registry used by SDMX, Eurostat or the United Nations; but with a much more practical and accessible approach without losing advanced functionalities. SDMX is the data and metadata sharing standard on which the aforementioned organizations are based. The use of this standard in applications such as ISTAC's makes it possible to homogenize in a simple way all the resources associated with the statistical data to be published. 

The publication of data under the SDMX standard is a more laborious process, as it requires the generation of not only the data but also the publication keys, but in the long run it allows the creation of templates or statistical operations that can be compared with data from another country or region. 

The application recently launched by the ISTAC allows you to navigate through all the structural resources of the ISTAC, including families of classifications or concepts, in an interconnected way, so it operates as a network. 

Functionalities of the Semantic Asset Bank  

The main advantage of this new tool over the aforementioned registries is its ease of use. Which, in this case, is directly measured by how easy it is to find a specific resource.   

 

Thanks to the advanced search, specific resources can be filtered by ID, name, description and maintainer; to which is added the option of including only the results of interest, discriminating both by version and by whether they are recommended by the ISTAC or not.  

In addition, it is designed to be a large interconnected bank, so that, entering a concept, classifications are recommended, or that in a DSD all the representations of the dimensions and attributes are linked. 

 

These features not only differentiate the Semantic Asset Bank from other similar tools, but also represent a step forward in terms of interoperability and transparency by not only offering semantic resources but also their relationships with each other.  

The new ISTAC resource complies with the provisions both at national level with the National Interoperability Scheme (Article 10, semantic assets), and at European level with the European Interoperability Framework (Article 3.4, semantic interoperability). Both documents defend the need and value of using common resources for the exchange of information, a maxim that is being implemented transversally in the Government of the Canary Islands. 

Training Pill  

To disseminate this new search engine for semantic assets, the ISTAC has published a short video explaining the Bank and its features, as well as providing the necessary information about SDMX. In this video it is possible to know, in a simple way and in just a few minutes how to use and get the most out of the new Semantic Assets Bank of the ISTAC through simple and complex searches and how to organize the data to respond to a previous analysis. 

 

In summary, with the Semantic Asset Bank, the Canary Islands Statistics Institute has taken a significant step towards facilitating the reuse of its semantic assets. This tool, which brings together tens of thousands of structural resources, allows easy access to an interconnected network that complies with national and European interoperability standards. 

calendar icon
Noticia

The Canary Islands Statistics Institute (ISTAC) has added  more than 500 semantic assets and more than 2100 statistical cubes to its catalogue.

This vast amount of information represents decades of work by the ISTAC in standardisation and adaptation to leading international standards, enabling better sharing of data and metadata between national and international information producers and consumers.

The increase in datasets not only quantitatively improves the directory at datos.canarias.es and datos.gob.es, but also broadens the uses it offers due to the type of information added.

New semantic assets

Semantic resources, unlike statistical resources, do not present measurable numerical data , such as unemployment data or GDP, but provide homogeneity and reproducibility.

These assets represent a step forward in interoperability, as provided for both at national level with the National Interoperability Scheme ( Article 10, semantic assets) and at European level with the European Interoperability Framework (Article 3.4, semantic interoperability). Both documents outline the need and value of using common resources for information exchange, a maxim that is being pursued at implementing in a transversal way in the Canary Islands Government. These semantic assets are already being used in the forms of the electronic headquarters and it is expected that in the future they will be the semantic assets used by the entire Canary Islands Government.

Specifically in this data load there are 4 types of semantic assets:

  • Classifications (408 loaded): Lists of codes that are used to represent the concepts associated with variables or categories that are part of standardised datasets, such as the National Classification of Economic Activities (CNAE), country classifications such as M49, or gender and age classifications.
  • Concept outlines (115 uploaded): Concepts are the definitions of the variables into which the data are disaggregated and which are finally represented by one or more classifications. They can be cross-sectional such as "Age", "Place of birth" and "Business activity" or specific to each statistical operation such as "Type of household chores" or "Consumer confidence index".
  • Topic outlines (2 uploaded): They incorporate lists of topics that may correspond to the thematic classification of statistical operations or to the INSPIRE topic register.
  • Schemes of organisations (6 uploaded): This includes outlines of entities such as organisational units, universities, maintaining agencies or data providers.

All these types of resources are part of the international SDMX (Statistical Data and Metadata Exchange) standard, which is used for the exchange of statistical data and metadata. The SDMX provides a common format and structure to facilitate interoperability between different organisations producing, publishing and using statistical data.

The Canary Islands Statistics Institute (ISTAC) has added  more than 500 semantic assets and more than 2100 statistical cubes to its catalogue.

This vast amount of information represents decades of work by the ISTAC in standardisation and adaptation to leading international standards, enabling better sharing of data and metadata between national and international information producers and consumers.

The increase in datasets not only quantitatively improves the directory at datos.canarias.es and datos.gob.es, but also broadens the uses it offers due to the type of information added.

New semantic assets

Semantic resources, unlike statistical resources, do not present measurable numerical data , such as unemployment data or GDP, but provide homogeneity and reproducibility.

These assets represent a step forward in interoperability, as provided for both at national level with the National Interoperability Scheme ( Article 10, semantic assets) and at European level with the European Interoperability Framework (Article 3.4, semantic interoperability). Both documents outline the need and value of using common resources for information exchange, a maxim that is being pursued at implementing in a transversal way in the Canary Islands Government. These semantic assets are already being used in the forms of the electronic headquarters and it is expected that in the future they will be the semantic assets used by the entire Canary Islands Government.

Specifically in this data load there are 4 types of semantic assets:

  • Classifications (408 loaded): Lists of codes that are used to represent the concepts associated with variables or categories that are part of standardised datasets, such as the National Classification of Economic Activities (CNAE), country classifications such as M49, or gender and age classifications.
  • Concept outlines (115 uploaded): Concepts are the definitions of the variables into which the data are disaggregated and which are finally represented by one or more classifications. They can be cross-sectional such as "Age", "Place of birth" and "Business activity" or specific to each statistical operation such as "Type of household chores" or "Consumer confidence index".
  • Topic outlines (2 uploaded): They incorporate lists of topics that may correspond to the thematic classification of statistical operations or to the INSPIRE topic register.
  • Schemes of organisations (6 uploaded): This includes outlines of entities such as organisational units, universities, maintaining agencies or data providers.

All these types of resources are part of the international SDMX (Statistical Data and Metadata Exchange) standard, which is used for the exchange of statistical data and metadata. The SDMX provides a common format and structure to facilitate interoperability between different organisations producing, publishing and using statistical data.

 

calendar icon
Evento

From September 25th to 27th , Madrid will be hosting the fourth edition of the Open Science Fair, an international event on open science that will bring together experts from all over the world with the aim of identifying common practices, bringing positions closer together and, in short, improving synergies between the different communities and services working in this field. 

This event is an initiative of OpenAIRE, an organisation that aims to create more open and transparent academic communication. This edition of the Open Science Fair is co-organised by the Spanish Foundation for Science and Technology (FECYT), which depends on the Ministry of Science and Innovation, and is one of the events sponsored by the Spanish Presidency of the spanish Presidency of the Council of the European Union

The current state of open science

Science is no longer the preserve of scientists. Researchers, institutions, funding agencies and scientific publishers are part of an ecosystem that carries out work with a growing resonance with the public and a greater impact on society. In addition, it is becoming increasingly common for research groups to open up to collaborations with institutions around the world. Key to making this collaboration possible is the availability of data that is open and available for reuse in research.

However, to enable international and interdisciplinary research to move forward, it is necessary to ensure interoperability between communities and services, while maintaining the capacity to support different workflows and knowledge systems. 

The objectives and programme of the Open Science Fair

In this context, the Open Science Fair 2023 is being held, with the aim of bringing together and empowering open science communities and services, identifying common practices related to open science to analyse the most suitable synergies and, ultimately sharing experiences that are developed in different parts of the world. 

The event has an interesting programme that includes keynote speeches from relevant speakers, round tables, workshops, and training sessions, as well as a demonstration session. Attendees will be able to share experiences and exchange views, which will help define the most efficient ways for communities to work together and draw up tailor-made roadmaps for the implementation of open science

This third edition of Open Science will focus on 'Open Science for Future Generations' and the main themes it will address, as highlighted on the the event's website, are:

  • Progress and reform of research evaluation and open science. Connections, barriers and the way forward.
  • Impact of artificial intelligence on open science and impact of open science on artificial intelligence.
  • Innovation and disruption in academic publishing.
  • Fair data, software and hardware.
  • Openness in research and education.
  • Public engagement and citizen science.

Open science and artificial intelligence 

The artificial intelligence is gaining momentum in academia through data analysis. By analysing large amounts of data, researchers can identify patterns and correlations that would be difficult to reach through other methods. The use of open data in open science opens up an exciting and promising future, but it is important to ensure that the benefits of artificial intelligence are available to all in a fair and equitable way. 

Given its high relevance, the Open Science Fair will host two keynote lectures and a panel discussion on 'AI with and for open science'. The combination of the benefits of open data and artificial intelligence is one of the areas with the greatest potential for significant scientific breakthroughs and, as such, will have its place at the event is one of the areas with the greatest potential for significant scientific breakthroughs and, as such, will have its place at the event. It will look from three perspectives (ethics, infrastructure and algorithms) at how artificial intelligence supports researchers and what the key ingredients are for open infrastructures to make this happen. 

The programme of the Open Science Fair 2023 also includes the presentation of a demo of a tool for mapping the research activities of the European University of Technology EUt+ by leveraging open data and natural language processing. This project includes the development of a set of data-driven tools. Demo attendees will be able to see the developed platform that integrates data from public repositories, such as European research and innovation projects from CORDIS, patents from the European Patent Office database and scientific publications from OpenAIRE. National and regional project data have also been collected from different repositories, processed and made publicly available. 

These are just some of the events that will take place within the Open Science Fair, but the full programme includes a wide range of events to explore multidisciplinary knowledge and research evaluation. 

Although registration for the event is now closed, you can keep up to date with all the latest news through the hashtag #OSFAIR2023 on Twitter, LinkedIn and Facebook, as well as on the event's website website

In addition, on the website of datos.gob.es and on our social networks you can keep up to date on the most important events in the field of open data, such as those that will take place during this autumn.

calendar icon
Application

This free software application offers a map with all the trees in the city of Barcelona geolocated by GPS. The user can access in-depth information on the subject. For example, the program identifies the number of trees in each street, their condition and even the species. 

The application's developer, Pedro López Cabanillas, has used datasets from Barcelona's open data portal (Open Data Barcelona) and states, in his blog, that it can be useful for botany students or "curious users". The Barcelona Trees application is now in its third beta version.  

The program uses the framework Qt, C++ and QML languages, and can be built (using a suitable modern compiler) for the most common targets: Windows, macOS, Linux and Android operating systems.

calendar icon
Noticia

Gaia-X represents an innovative paradigm for linking data more closely to the technological infrastructure underneath, so as to ensure the transparency, origin and functioning of these resources. This model allows us to deploy a sovereign and transparent data economy, which respects European fundamental rights, and which in Spain will take shape around the sectoral data spaces (C12.I1 and C14.I2 of the Recovery, Transformation and Resilience Plan). These data spaces will be aligned with the European regulatory framework, as well as with governance and instruments designed to ensure interoperability, and on which to articulate the sought-after single data market.

In this sense, Gaia-X interoperability nodes, or Gaia-X Digital Clearing House (GXDCH), aim to offer automatic validation services of interoperability rules to developers and participants of data spaces. The creation of such nodes was announced at the Gaia-X Summit 2022 in Paris last November. The Gaia-X architecture, promoted by the Gaia-X European Association for Data & Cloud AISBL, has established itself as a promising technological alternative for the creation of open and transparent ecosystems of data sets and services.

These ecosystems, federated by nature, will serve to develop the data economy at scale. But in order to do so, a set of minimum rules must be complied with to ensure interoperability between participants. Compliance with these rules is precisely the function of the GXDCH, serving as an "anchor" to deploy certified market services. Therefore, the creation of such a node in Spain is a crucial element for the deployment of federated data spaces at national level, which will stimulate development and innovation around data in an environment of respect for data sovereignty, privacy, transparency and fair competition.

The GXDCH is defined as a node where operational services of an ecosystem compliant with the Gaia-X interoperability rules are provided. Operational services" should be understood as services that are necessary for the operation of a data space, but are not in themselves data sharing services, data exploitation applications or cloud infrastructures. Gaia-X defines six operational services, of which at least two must be part of the mandatory nodes hosting the GXDCHs:

Mandatory services

  • Gaia-X Registry: Defined as an immutable, non-repudiable, distributed database with code execution capabilities. Typically it would be a blockchain infrastructure supporting a decentralised identity service ('Self Sovereign Identity') in which, among others, the list of Trust Anchors or other data necessary for the operation of identity management in Gaia-X is stored.
  • Gaia-X Compliance Service or Gaia-X Compliance Service: Belongs to the so-called Gaia-X Federation Services and its function is to verify compliance with the minimum interoperability rules defined by the Gaia-X Association (e.g. the Trust Framework).

Optional services

  • Self-Descriptions (SDs) or Wizard Edition Service: SDs are verifiable credentials according to the standard defined by the W3C by means of which both the participants of a Gaia-X ecosystem and the products made available by the providers describe themselves. The aforementioned compliance service consists of validating that the SDs comply with the interoperability standards. The Wizard is a convenience service for the creation of Self-Descriptions according to pre-defined schemas.
  • Catalogue: Storage service of the service offer available in the ecosystem for consultation.
  • e-Wallet: For the management of verifiable credentials (SDs) by participants in a system based on distributed identities.
  • Notary Service: Service for issuing verifiable credentials signed by accreditation authorities (Trust Anchors).

What is the Gaia-X Compliance Service (i.e. Compliance Service)?

The Gaia-X Compliance Service belongs to the so-called Gaia-X Federation Services and its function is to verify compliance with the minimum interoperability rules defined by the Gaia-X Association. Gaia-X calls these minimum interoperability rules (Trust Framework). It should be noted that the establishment of the Trust Framework is one of the differentiating contributions of the Gaia-X technology framework compared to other solutions on the market. But the objective is not just to establish interoperability standards, but to create a service that is operable and, as far as possible, automated, that validates compliance with the Trust Framework. This service is the Gaia-X Compliance Service.

The key element of these rules are the so-called "Self-Descriptions" (SDs). SDs are verifiable credentials according to the standard defined by the W3C by which both the participants of a data space and the products made available by the providers describe themselves. The Gaia-X Compliance service validates compliance with the Trust Framework by checking the SDs from the following points of view:

  • Format and syntax of the SDs
  • Validation of the SDs schemas (vocabulary and ontology)
  • Validation of the cryptography of the signatures of the issuers of the SDs
  • Attribute consistency
  • Attribute value veracity.

Once the Self-Descriptions have been validated, the compliance service operator issues a verifiable credential that attests to compliance with interoperability standards, providing confidence to ecosystem participants. Gaia-X AISBL provides the necessary code to implement the Compliance Service and authorises the provision of the service to trusted entities, but does not directly operate the service and therefore requires the existence of partners to carry out this task.

 

 

calendar icon