Published the report "Study on reusable data as linguistic resources"

Share

Fecha de la noticia: 08-10-2019

Language Technologies are a field of Artificial Intelligence, which allows, among other things, to automatically exploit the large amount of textual and oral information we have access to. To boost its development, the Secretary of State for Digital Advancement (SEAD) is carrying out the Plan for the Advancement of Language Technologies (TL Plan), focused on Natural Language Processing (PLN), Automatic Translation (TA) and Conversational Systems (SSCC) in Spain, especially in Spanish and co-official languages.

One of the activities included in the TL plan has been the report “Study on reusable data as linguistic resources”, carried out by Antonio Moreno Sandoval (Director of the Laboratory of Computer Linguistics, Autonomous University of Madrid), Doroteo Torre Toledano (researcher in Artificial Intelligence applied to speech processing and professor at the Autonomous University of Madrid) and Ana Valverde (associate professor at the University of Castilla-La Mancha), funded by SEAD and Red.es. The purpose of the report is twofold:

On the one hand, taking a census of resources of the different public administrations that can be converted into linguistic resources (LR)
On the other, suggesting an action plan to address their conversion into LR.

When we speak about Linguistic Resource (LR), we refer to any electronic file that has been processed to serve as a source, training or evaluation of a language technology system. That is, in order to converting open data to a linguistic resource, it is necessary to collect and adapt it to formats that are likely to be used by Linguistic Technology applications.

What is the maturity status of the resources analysed?

The report analyses resources and public data sets that could become linguistic resources. From 101 resources analysed, 24 sets of resources were selected for detailed analysis and evaluation based on criteria of interest (data quality, quantity and availability), multilingualism, intellectual property, thematic variety, maturity degree and typology.

The selected sets address a series of thematic areas defined as priorities in the study: Competitive Intelligence, Health, Justice or Culture - there is also a “Others” section -; and cover issues as diverse as the video recordings of RTVE on demand, the list of municipalities of the INE or the database of rare diseases and medicines of OrphaData.

The degree of maturity of each of them has been calculated based on technical and legal aspects. This analysis has determined that the majority of the sets analysed have a low or medium maturity. This is because the requirements to be considered a mature resource are very strict: the data need to be already processed and in formats that can be used directly by the PLN researchers.

However, the current state of the open data that can become RL is very promising, since there is a large collection of scanned documents with enough potential to become RL.

What is the Spanish situation compared to other international initiatives?

In order to better understand Spain's maturity status in this area, it was necessary to compare the RL and TL advancement initiatives with other countries in Latam, Europe, and some major powers such as the United States or Canada. The report also highlights the European Language Resource Coordination portal, which provides a list of open resources in Europe.

After the analysis, it was concluded that France and the United Kingdom are the referents at European level. Although, the European Commission has warned of the delay with respect to English and American companies, which have capitalized on the use of linguistic Big Data, allowing large technological multinationals such as Google, Microsoft, Amazon or IBM to offer linguistic services in very different domains and languages.

Spain is well position, although it should be noted that its maturity status is higher in open data than in open RL. That is to say, we find numerous open data, of different typology and size, and under different reusable formats, but not as many linguistic resources ready for use in TL as in other countries.

Recommendations for the development of an action plan

In short, the research community in PLN in Spain needs RL of quality and in sufficient quantity to develop competitive applications in the international market. Therefore, it is necessary that the conversion to RL follow international technical standards to ensure interoperability between data and processors, such as segmentation of text into units or using a standard language code.

The report also includes a specification on recommended formats by type of resource and useful tools to prepare data such as RL, such as segmenters, tokenizers, anonymizers, annotators and entity recognizers, among others.

Tipo de recurso	Formato recomendado
Textual corpus	Annotation in XML or TXT in UTF-8 encoding	JSON, CSV; PDF is not convenient
Audio corpus	WAV 16 bits, 16 KHz. (voice) or 44.1 KHz (music, audio)	FLAC; MP3 (high quality); other convertible formats (with possible loss of quality)
Video corpus	High-quality MPEG-4 (MP4)	H.264; any other convertible high-quality format
Translation memories corpus	TMX	CSV
Named entities and lexical resources	Annotation in XML or TXT in UTF-8 encoding	JSON, CSV, RDF

Likewise, a series of recommendations are provided to prepare the data as RL. These recommendations are:

Guarantee the availability and universal access to open data for RL in all the languages of the State through a common and unique portal.
Promote the conversion of millions of scanned pages into PDF or plain text image.
Improve the visibility of data sets in terms of availability and maturity.
Facilitate the massive download of large files in appropriate formats (plain text, XML, CSV, JSON, RDF).
Creation of annotated resources of general utility and open availability.
Provide open data conversion tools to RL.
Provide transcription of multimedia files.
Stimulate the adoption of free-use licenses and access to data.
Stimulate the reuse of data by organizing technological competitions based on them.
Facilitate access to computing and storage capacity for large volumes of data.
Promote the publication of anonymized data sets, essential in medical or legal documents.
In translation resources, identify the source and target languages, as well as the alignment of the “translation units”