The RAE's data corpora for training artificial intelligence models

Share

Fecha de la noticia: 14-04-2025

A corpus, in linguistic terms, is a structured set of texts or linguistic data used to analyse and study language use. These texts can be both written and oral. Corpuses are essential for understanding linguistic patterns, studying language variation and, in the case of artificial intelligence, training natural language processing models (NLP, ). These models need large volumes of data to learn to understand, generate and analyse human language. Thanks to corpora, artificial intelligence systems can identify grammatical structures, analyse the context of words and improve their ability to interact with users.

The Royal Spanish Academy (RAE) has developed several language corpora of great value for researchers, linguists and now also for artificial intelligence developers. The CREA (Corpus de Referencia del Español Actual), the CORPES XXI, the CDH (Corpus del Diccionario Histórico de la lengua española) and the CORDE (Corpus Diacrónico del Español) stand out. In this post, we will analyse the content of each of them, which beyond their linguistic and cultural value are open access contents.

Created: Corpus de Referencia del Español Actual

The Create Textbook is a collection of texts of diverse provenance that allows the study of words, their meanings and their contexts. It has more than 160 million forms (elements), including texts produced between 1975 and 2004 in all Spanish-speaking countries.

The CREA corpus is organised according to four main criteria:

Medium: 49% comes from books, 49% from press and 2% from miscellaneous material.
Chronological: texts classified in five-year periods (1975-1979, 1980-1984, etc.), with greater weight in the most recent sections.
Geographical: 50% from Spain and 50% from America, distributed in traditional linguistic areas (Andean, Caribbean, Mexico and Central America, etc.).
Thematic: organised into six broad thematic areas subdivided into more specific topics.

Beyond its use for training AI models, CREA is also used to build dictionaries, conduct research and develop technological tools such as spell checkers, machine translation systems or thesauri and other writing assistance tools.

This corpus has its versions of CREA oral and CREA written,on the one hand. And, on the other hand, in December 2023, the annotated CREA version 1.0 was published, which allows searching by form, lemma and grammatical category. This version contains more than 111,000 documents totalling more than 122.5 million forms.

CORPES XXI: Corpus of 21st Century Spanish

As CREA, the CORPES XXI CREA is a reference corpus that serves to know the meaning and characteristics of words, expressions and constructions from actual recorded uses. The current version (1.2), released in November 2024, contains:

More than 390,000 written texts.
Almost 1,000 texts from oral transcriptions.
A total of almost 425 million forms.

In addition, CORPES XXI includes a comprehensive query system that allows you to search for specific words, expressions or grammatical categories. April 2024 saw the addition of a lexical frequency dictionary, an invaluable tool for AI researchers and developers.

HRC: Corpus of the Diccionario histórico de la lengua española

The CDH is a fundamental resource for diachronic studies (successive over time), with 355 million records distributed in three layers of consultation:.

CDH nuclear: contains more than 53 million items (32 million Spanish texts and more than 20 million American works). The texts have been subjected to a semi-automatic process of linguistic annotation and lemmatisation.
S. XII-1975: compiles texts between the 12th century and 1975, consisting of a selection of CORDE works with 199 million forms. These texts have been pre-annotated with morphosyntactic pre-annotation using free software tools.
1975-2000: includes works dated between 1975 and 2000, with titles from CREA, linguistically annotated by the Technology Department of the RAE (103 million records).

CORDE: Diachronic Corpus of Spanish

The CORDE is a textual corpus that covers all the times and places in which Spanish was spoken, from the beginnings of the language until 1974. Designed to extract information for studying words, meanings and grammar over time, it has 250 million records for written texts of various genres.

The texts are in prose and verse, including narrative, lyrical, dramatic, scientific-technical, historical, legal, religious and journalistic texts. The CORDE includes all the geographical, historical and generic varieties in order to be sufficiently representative, making it an obligatory source for any diachronic study of Spanish.

Applications and advantages of the RAE's linguistic corpus

These corpora represent valuable resources for training artificial intelligence models for several reasons:

Linguistic representativeness: they cover all the varieties of Spanish, making it possible to develop systems that understand the regional and historical particularities of the language.
Morphosyntactic annotation: the annotated versions allow models to be trained with greater grammatical accuracy, by identifying not only words but also their grammatical categories.
Diachronic analysis: the combination of contemporary corpora (CREA, CORPES XXI) with historical corpora (CORDE, CDH) makes it possible to study the evolution of language, a fundamental aspect for models that have to process texts from different periods.
Lexical richness: the inclusion of the General File provides a treasure trove of lexical and lexicographical information difficult to find in other resources.

In addition, this set of RAE corpora offers significant advantages for the development of language models in Spanish, such as:

Complete temporal coverage: from the origins of the language to the present day, enabling historically sensitive model training.
Geographic diversity: representation of all variants of Spanish, essential for creating systems that work well in any Spanish-speaking country.
Philological quality: the texts have been selected and processed with academic rigour, guaranteeing the quality of the training data.
Linguistic notation: morphosyntactic labelling facilitates the training of models that understand the grammatical structure of Spanish.

Thanks to corpora such as those of the RAE, artificial intelligence models can improve their understanding of Spanish, adapt to regional differences and provide more accurate and natural responses. In this way, human language and technology are connected, expanding the possibilities for interaction and mutual learning.

For more information or to access these resources, interested parties may consult the RAE website or contact directly at corpus@rae.es..