Perhaps one of the most everyday uses of artificial intelligence that we can experience in our day-to-day lives is through interaction with artificial vision and object identification systems. From unlocking our smartphone to searching for images on the Internet. All these functionalities are possible thanks to artificial intelligence models in the field of image detection and classification. In this post we compile some of the most important open image repositories, thanks to which we have been able to train current image recognition models.
Introduction
Let's go back for a moment to late 2017, early 2018. The possibility of unlocking our smartphones with some kind of fingerprint reader has become widespread. With more or less success, most manufacturers had managed to include the biometric reader in their terminals. The unlocking time, the ease of use and the extra security provided were exceptional compared to the classic systems of passwords, patterns, etc. As has been the case since 2008, the undisputed leader in digital innovation in mobile terminals - Apple - revolutionised the market once again by incorporating an innovative unlocking system in the iPhone X using an image of our face. The so-called FaceID system scans our face to unlock the terminal in tenths of a second without having to use our hands. The probability of identity theft with this system was 1 in 1,000,000; 20 times more secure than its predecessor TouchID
Let this little story about an everyday functionality be used to introduce an important topic in the field of artificial intelligence, and in particular in the field of computer image processing: AI model training image repositories. We have talked a lot in this space about this field of artificial intelligence. A few months after the launch of FaceID, we published a post on AI, in which we mentioned near-human-level image classification as one of the most important achievements of AI in recent years. This would not be possible without the availability of open banks of annotated images[1] to train image recognition and classification models. In this post we list some of the most important (freely available) image repositories for model training.
Of course, recognising the number plate of a vehicle at the entrance to a car park is not the same as identifying a lung disease in an X-ray image. The banks of annotated images are as varied as the potential AI applications they enable.
Probably the 2 best known image repositories are MNIST and ImageNET.
- MNIST, is a set of 70,000 black and white images of handwritten numbers normalised in size, ready to train number recognition algorithms. Professor LeCun's original paper is from 1998.
- ImageNET is a huge database of concepts (words or sets of words). Each concept with its own meaning is called a synset. Each synset is represented by hundreds or thousands of images. ImageNET's own website cites the project as an indispensable tool for the recent advancement of Deep Learning and computer vision.
The project has been instrumental in advancing computer vision and deep learning research. The data is available for free to researchers for non-commercial use
The most widely used subset of ImageNet is ImageNet Large Scale Visual Recognition Challenge ILSVRC, an image classification and localisation dataset. This image subset was used from 2010 to 2017 for the worldwide object detection and image classification competitions. This dataset covers 1000 object classes and contains more than one million training images, 50,000 validation images and 100,000 test images. This subset is available in Kaggle.
In addition to these two classic repositories that are already part of the history of image processing by artificial intelligence, we have some more current and varied thematic repositories. Here are some examples:
- The very annoying CAPTCHAs and reCAPTCHAs that we find on a multitude of websites to verify that we are human trying to access are a good example of artificial intelligence applied to the field of security. Of course, CAPTCHAs also need their own repository to check how effective they are in preventing unwanted access. We recommend reading this interesting article about the history of these web browsing companions.
- As we have seen several times in the past, one of the most promising applications of AI in the field of imaging is to assist physicians in diagnosing diseases from a medical imaging test (X-ray, CT scan, etc.). To make this a reality, there is no shortage of efforts to collect, annotate and make available to the research community repositories of quality, anonymised medical images to train models for detecting objects, shapes and patterns that may reveal a possible disease. Breast cancer are 30% of all cancers in women worldwide. Hence the importance of having image banks that facilitate the training of specific models.
- The diagnosis of blood-based diseases often involves the identification and characterisation of patient blood samples. Automated methods (using medical imaging) to detect and classify blood cell subtypes have important medical applications.
- Three years ago, Covid19 burst into our lives, turning developed societies upside down with this global pandemic with terrible consequences in terms of human and economic loss. The entire scientific community threw itself into finding a solution in record time to tackle the consequences of the new coronavirus. Many efforts were made to improve the diagnosis of the disease. Some techniques relied on AI-assisted image analysis. At the same time, health authorities incorporated a new element in our daily routine - face masks. Even today, in some situations the mask is still mandatory, and during these 3 years we have had to monitor its proper use in almost all kinds of places. So much so that in recent months there has been a proliferation of specific image banks to train AI and computer vision models to detect the use of masks autonomously.
- For more information on open repositories related to health and wellbeing, we leave you with this post we published a few months ago.
In addition to these curious examples cited in this post, we encourage you to explore Kaggle's section of datasets that include images as data. You only have 10,000 sets to browse through ;)
[1] Annotated image repositories contain, in addition to the image files (jpeg, tiff, etc.), descriptive files with metadata identifying each image. Typically, these files (csv, JSON or XML) include a unique identifier for each image as well as fields that provide information about the content of the image. For example, the name of the object that appears in the image.
Content prepared by Alejandro Alija, expert in Digital Transformation and Innovation.
The contents and views expressed in this publication are the sole responsibility of the author.
In recent times, open data has become an element of great value when it comes to improving the quality of life and offering greater benefits to citizens in different sectors. One of them is tourism, and it is that the number of public administrations that are opening their data in this field is increasing.
One of the main reasons is found in the great economic benefit that this sector brings to a country like Spain, which welcomes millions of tourists every year. Therefore, it is not surprising that municipalities and administrations show increasing interest in disseminating the services they offer in order to attract as many visitors as possible.
The data related to the tourism sector is highly dynamic and for this reason there are many organizations that are committed to offering it through APIs, which facilitate access in a much more efficient way.
The opening of data in the tourism sector it is a practice that encourages the creation of services and technologies capable of offering solutions to current problems from the reuse of open data. This is the case of some applications such as Casual Learn, which uses information from the Open Data Portal of Castilla y León for its users to learn art history while touring the community's monuments. Or also from Maps of Spain, a free viewer IGN aimed at citizens who want to carry out activities in nature, which they can access from their mobile phone without the need to connect to the internet.
If you are interested in accessing this type of data, then we have collected 10 examples of repositories related to tourism at an international level, divided into three categories: tourism, leisure and culture, and meteorology.
Tourism
DATA Tourisme
- Publisher: Government of Singapore
The Tourism Information and Service Center (TIH) is a digital resource platform that enables businesses and developers to access relevant information on Singapore's tourism offerings and travel software services.
Undoubtedly, the highlight of this portal is that it has an API to facilitate access to its data offer. Data APIs allow developers to access datasets related to Singapore tourism through an API key.
Accommodation, attractions, excursions, shopping centers and stores or number of visitors are just some examples of the type of data that can be found on this portal.
Tourism Information & Service Hub (TIH)
- Publicador: Gobierno de Singapur
El Centro de información y servicios turísticos (TIH) es una plataforma de recursos digitales que permite a empresas y desarrolladores acceder a información relevante sobre las ofertas turísticas y los servicios de software de viajes de Singapur.
Sin duda, lo más destacado de este portal es que cuenta con una API para facilitar el acceso a su oferta de datos. Las API de datos permiten a los desarrolladores acceder a conjuntos de datos relacionados con el turismo de Singapur a través de una clave API.
Alojamiento, atracciones, excursiones, centros comerciales y tiendas o número de visitantes son solo algunos ejemplos del tipo de datos que se pueden encontrar en este portal.
My Switzerland
- Publisher: Government of Switzerland
This platform offers data sets related to tourism in Switzerland that are provided through an API. It is a public API that presents tourist information translated into 16 languages and its main source of content is the portal from My Switzerland.
Currently this API provides data about tourist destinations, attractions and offers of interest, although this list will be expanded in the near future with more types of data depending on the needs of partners and reusers.
Places API
- Publisher: Google
This API developed by Google allows you to search for information on more than 200 million places through a wide variety of categories, including establishments, prominent points of interest or geographical locations.
Through this API, developers can access a wide variety of Google data to provide their users with a real-time location-based experience by displaying place names and information rather than a set of coordinates.
Leisure and culture
UK Natural History Museum
- Publisher: UK Natural History Museum
Through this portal it is possible to consult and download data about the museum's research and collections. It currently has an approximate number of 200 data sets on various topics such as entomology, zoology, botany, or paleontology, among others.
All datasets are available through a API to facilitate downloading for users who wish to use the data in their own software or applications.
European Group on Museum Statistics (EGMUS)
- Publisher: European Groupon Museum Statistics (EGMUS)
The European Group for Museum Statistics (EGMUS) is an organization founded in 2002 in which 30 European countries are represented. The main objective of EGMUS is the collection and publication of statistical data relating to the participating European museums.
Information available from national museum statistics and surveys is collected, updated and stored in the Abridged List of Museum Key Indicators (ALOKMI) table. ALOKMI is the first step towards the harmonization of museum statistics in Europe.
The data tables offered by EGMUS are available for download in CSV format.
IMAGES D’ART
- Publisher: Réunion des musées nationaux - Grand Palais
Images d'Art (Art Images) is a platform that offers an extensive database of hundreds of thousands of works by approximately 30,000 artists. This image database contains works from French museums that have been digitized and documented by the photography agency NMR-GP.
In this portal we can filter the information around some parameters such as museums, historical periods, authors, technique, keywords or advanced search.
Europeana
- Publisher: Europeana
Europeana is a portal that provides cultural heritage enthusiasts, practitioners, teachers and researchers with digital access to European cultural heritage material. This platform has information on more than 3,700 different institutions. A network of aggregator partners collects the data, thoroughly checks it, and enriches it with information such as geographic location or links it to other materials or data sets through associated people, places, or topics.
Europeana offers data on works of art, books, music and videos, newspapers, archaeology, fashion, science or sports, among many others. To facilitate access to this information, this portal has different APIs.
World Digital Library
- Publisher: World Digital Library (WDL)
The World Digital Library was a project created in 2009 by the United States Library of Congress, with the support of UNESCO and contributions from libraries, archives, museums, educational institutions, and international organizations around the world.
The WDL contains extremely interesting materials that are essential for understanding cultures around the world. The data it offers is available free of charge and in a wide variety of languages. In addition, it offers a menu that allows you to filter the data by format, date, location, theme or language, among others.
Meteorology
Open Meteo
- Publisher: Open Meteo
Open-Meteo offers a weather data API for free global weather forecasting. This API is especially aimed at open source developers and non-commercial use, to access it no password is required and its information is updated every 3 hours.
Data related to temperature, wind, pressure, humidity or precipitation are just some of the meteorological variables that users have available through this API.
This has been just a small selection of data repositories related to the tourism sector that could be of interest to you. Do you know any more relevant related to this field? Leave us a comment or send us an email at dinamizacion@datos.gob.es
The research data is very valuable, and its permanent access is one of the greatest challenges for all agents involved in the scientific world: research staff, funding agencies, publishers and academic institutions. The long-term conservation of data and the culture of open access are sources of new opportunities for the scientific community. More and more universities and research centers offer repositories with their research data, allowing permanent access to them. Thus, due to the requirements of each academic discipline, the existing repositories are very varied.
The research staff faces every day this universe of multiple repositories, tools, formats ... in which consulting the desired data without a guide requires many resources of time and effort. Re3data.orgis an international registry of research data repositories (Registry of Research Data Repositories) where metadata is collected from repositories specialized in storing research data. Thanks to this compilation work, the research staff, funding organizations, libraries and editors can search and visualize the main repositories of research data, being able to search and faceted views by discipline, subject, country, contents, formats, licenses, language, etc.
The re3data.org registry was born as a joint project of several German organizations, funded by the German Research Foundation (DFG). The official launch took place in May 2013 and the DataBib catalog was subsequently integrated to avoid duplication and confusion due to the existence of two similar parallel registers. The unification project was sponsored by DataCite, an international non-profit organization whose goal is to improve the quality of data citations. In addition, re3data.org collaborates with other Open Science projects such as BioSharing or OpenAIRE.
Multiple publishers, research institutions and funding organizations refer to the re3data.org registry in their editorial policies or guidelines, as the ideal tool for the identification of data repositories. One of the most notable examples is the European Commission (together with Nature and Springer), since it mentions it in the document "Guidelines for Guidelines to the Rules on Open Access to Scientific Publications and Open Access to Research Data in Horizon 2020
Currently, the metadata of the repositories stored are those listed in version 3 of the Metadata Schema for the Description of Research Data Repositories.
The registry identifies and lists nearly 2,000 repositories of research data, which makes re3data.org the largest and most complete repository of data repositories available on the web. Its growth has been constant since its launch, covering a wide range of disciplines.
As regards Spain, and as of December 1, 2017, 23 repositories of research data are cataloged in which Spain participates.
The promotion of open science, the culture of exchange, the reuse of information and open access is found in the foundations of the re3data.org project. And on these solid foundations the tool continues increasing the collected metadata, and therefore the visibility of the research data. Continuing to work on increasing this visibility and enhancing open science is not only essential to guarantee research work based on previous milestones, but it also allows us to exponentially expand the horizons of scientific work.
Beyond public administrations, libraries, museums and cultural foundations data, the interest in open data knows no borders. We invite you to discover it in this post.
Normally, the concept of open data is associated with those repositories managed by public administrations, foundations and cultural organizations such as libraries and museums. But open data include much more and, if we use our search thoroughly, we can find real jewels waiting to be explored. Many times they are repositories on specific topics, very useful for professionals who develop their work in that field. Others, they are general repositories with unusual data sets.
Let's see some examples.
Open data and science
To illustrate the specific data repositories, let's focus on two examples of the scientific field:
1) Open data portal of the European Space Agency. On this website we can find a large number of images and data from the different space missions of the European Space Agency (for its acronym in English, ESA). For example, most satellite images of the Copernicus program - the most ambitious Earth observation program to date - provide accurate, timely and easily accessible information to improve environmental management, understand and mitigate the effects of climate change and ensure civil security.
Example of an open image under CC BY-SA 3.0 IGO license from the ESA open data repository, in particular from the Copernicus ground observation program.
ESA not only makes available images and videos from satellites, but also a large amount of observation data that can be processed to generate our own images or analysis. As an example, the data generated by the Gaia mission - the most ambitious mission to draw a three-dimensional map of our Galaxy - is available for direct download on this link. Browsing the links that depend on the main repository, we can access to files in .csv format of several tens of MB ready for analysis.
2) CERN open data portal. CERN is the European laboratory for nuclear research. The place where the Web (World Wide Web) was born, concentrates a good part of the best scientific talent in Europe and generates several dozen petabytes of data per year. In this way, CERN also has its own website dedicated to open data. The CERN open data site is a very user-friendly website for the non-expert user that proposes different ways of approaching the data stored there. There are different paths to explore the site depending on whether we follow the Learn, Visualize or Analyze path. This website is a vergel of data, but it is necessary to have basic (or not so basic) notions of particle physics, to exploit its full potential.
In addition to the core site, CERN makes available to (advanced) users a Github site so that those developers who want to work with open data, have a more suitable environment for the exploitation of data programmatically. Github sites or other open source repositories enhance the development of collaborative users’ communities around open data.
Very, very diverse data
But in addition to these specific repositories, there are also general theme repositories whit unusual data sets. We have already spoken on previous occasions of the Kaggle website. Kaggle is an open web platform aimed at data scientists in which challenges are posed (some of them paid with high cash prizes). This time we approached Kaggle only to explore its extensive catalog of data (mostly published under a Creative Commons license in one of its variants).
To cite some varied examples, looking in the first entries of its catalog we find data sets on the height of the waves on the Australian coast or for example, a data set that includes a list of 10,000 women's shoes with their prices published under license CC BY-NC-SA 4.0. You could not miss in this list one of the most popular and used data sets today. Every quarter, Stackoverflow, the largest online community for programmers, publishes an extraction of its database with the posts, votes, tags and comments that have passed through its platform. The analysis of this data set (published under CC BY-SA 3.0) of more than 100 GB in volume is probably the most accurate way to measure market trends in terms of popularity and use of existing programming languages.
In short, in addition to the existing data sets on mobility, environment, location of basic services in cities or cultural collections, there are open data repositories, much more specific, for those intrepid users who dare to investigate in search of less common data. Of course, the future of open data has no borders.
Content prepared by Alejandro Alija, expert in Digital Transformation and innovation.
Contents and points of view expressed in this publication are the exclusive responsibility of its author.