Subscribe to procesamiento del lenguaje natural

Explainable artificial intelligence (XAI): how open data can help understand algorithms

Blog

The increasing adoption of artificial intelligence (AI) systems in critical areas such as public administration, financial services or healthcare has brought the need for algorithmic transparency to the forefront. The complexity of AI models used to make decisions such as granting credit or making a medical diagnosis, especially when it comes to deep learning algorithms, often gives rise to what is commonly referred to as the "black box" problem, i.e. the difficulty of interpreting and understanding how and why an AI model arrives at a certain conclusion. The LLLMs or SLMs that we use so much lately are a clear example of a black box system where not even the developers themselves are able to foresee their behaviour.

In regulated sectors, such as finance or healthcare, AI-based decisions can significantly affect people's lives and therefore it is not acceptable to raise doubts about possible bias or attribution of responsibility. As a result, governments have begun to develop regulatory frameworks such as the Artificial Intelligence Regulation that require greater explainability and oversight in the use of these systems with the additional aim of generating confidence in the advances of the digital economy.

Explainable artificial intelligence (XAI) is the discipline that has emerged in response to this challenge, proposing methods to make the decisions of AI models understandable. As in other areas related to artificial intelligence, such as LLLM training, open data is an important ally of explainable artificial intelligence to build audit and verification mechanisms for algorithms and their decisions.

What is explainable AI (XAI)?

Explainable AI refers to methods and tools that allow humans to understand and trust the results of machine learning models. According to the U.S. National Institute of Standards and Technology (NIST), the NIST is the only organisation in the U.S. that has a national standards body. The four key principles of Explainable Artificial Intelligence in the US are to ensure that AI systems are transparent, understandable and trusted by users:

Explainability (Explainability): the AI must provide clear and understandable explanations of how it arrives at its decisions and recommendations.
Meaningful (Meaningful): explanations must be meaningful and understandable to users.
Accuracy (Accuracy): AI must generate accurate and reliable results, and the explanation of these results must accurately reflect its performance.
Knowledge Limits (Knowledge Limits): AI must recognise when it does not have sufficient information or confidence in a decision and refrain from issuing responses in such cases.

Unlike traditional "black box" AI systems, which generate results without revealing their internal logic, XAI works on the traceability, interpretability and accountability of these decisions. For example, if a neural network rejects a loan application, XAI techniques can highlight the specific factors that influenced the decision. Thus, while a traditional model would simply return a numerical rating of the credit file, an XAI system could also tell us something like "Payment history (23%), job stability (38%) and current level of indebtedness (32%) were the determining factors in the loan denial". This transparency is vital not only for regulatory compliance, but also for building user confidence and improving AI systems themselves.

Key techniques in XAI

The Catalogue of trusted AI tools and metrics from the OECD's Artificial Intelligence Policy Observatory (OECD.AI) collects and shares tools and metrics designed to help AI actors develop trusted systems that respect human rights and are fair, transparent, explainable, robust, safe and reliable. For example, two widely adopted methodologies in XAI are Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP).

LIME approximates complex models with simpler, interpretable versions to explain individual predictions. It is a generally useful technique for quick interpretations, but not very stable in assigning the importance of variables from one example to another.
SHAP quantifies the exact contribution of each input to a prediction using game theory principles. This is a more precise and mathematically sound technique, but much more computationally expensive.

For example, in a medical diagnostic system, both LIME and SHAP could help us interpret that a patient's age and blood pressure were the main factors that led to a diagnosis of high risk of infarction, although SHAP would give us the exact contribution of each variable to the decision.

One of the most important challenges in XAI is to find the balance between the predictive ability of a model and its explainability. Hybrid approaches are therefore often used, integrating a posteriori explanatory methods of decision making with complex models. For example, a bank could implement a deep learning system for fraud detection, but use SHAP values to audit its decisions and ensure that no discriminatory decisions are made.

Open data in the XAI

There are at least two scenarios in which value can be generated by combining open data with explainable artificial intelligence techniques:

The first of these is the enrichment and validation of the explanations obtained with XAI techniques. Open data makes it possible to add layers of context to many technical explanations, which is also true for the explainability of AI models. For example, if an XAI system indicates that air pollution influenced an asthma diagnosis, linking this result to open air quality datasets from patients' areas of residence would allow validation of the correctness of the result.
Improving the performance of AI models themselves is another area where open data brings value. For example, if an XAI system identifies that the density of urban green space significantly affects cardiovascular risk diagnoses, open urban planning data could be used to improve the accuracy of the algorithm.

It would be ideal if AI model training datasets could be shared as open data, so that it would be possible to verify model training and replicate the results. What is possible, however, is the open sharing of detailed metadata on such trainings as promoted by Google's Model Cards initiative, thus facilitating post-hoc explanations of the models' decisions. In this case it is a tool more oriented towards developers than towards the end-users of the algorithms.

In Spain, in a more citizen-driven initiative, but equally aimed at fostering transparency in the use of artificial intelligence algorithms, the Open Administration of Catalonia has started to publish comprehensible factsheets for each AI algorithm applied to digital administration services. Some are already available, such as the AOC Conversational Chatbots or the Video ID for Mobile idCat.

Real examples of open data and XAI

A recent paper published in Applied Sciences by Portuguese researchers exemplifies the synergy between XAI and open data in the field of real estate price prediction in smart cities. The research highlights how the integration of open datasets covering property characteristics, urban infrastructure and transport networks, with explainable artificial intelligence techniques such as SHAP analysis, unravels the key factors influencing property values. This approach aims to support the generation of urban planning policies that respond to the evolving needs and trends of the real estate market, promoting sustainable and equitable growth of cities.

Another study by researchers at INRIA (French Institute for Research in Digital Sciences and Technologies), also on real estate data, delves into the methods and challenges associated with interpretability in machine learning based on linked open data. The article discusses both intrinsic techniques, which integrate explainability into model design, and post hoc methods that examine and explain complex systems decisions to encourage the adoption of transparent, ethical and trustworthy AI systems.

As AI continues to evolve, ethical considerations and regulatory measures play an increasingly important role in creating a more transparent and trustworthy AI ecosystem. Explainable artificial intelligence and open data are interconnected in their aim to foster transparency, trust and accountability in AI-based decision-making. While XAI provides the tools to dissect AI decision-making, open data provides the raw material not only for training, but also for testing some XAI explanations and improving model performance. As AI continues to permeate every facet of our lives, fostering this synergy will contribute to building systems that are not only smarter, but also fairer.

Content prepared by Jose Luis Marín, Senior Consultant in Data, Strategy, Innovation & Digitalization. The contents and views reflected in this publication are the sole responsibility of the author.

27/02/2025

First AI language models available in the four official languages as part of ALIA project

Noticia

Since last week, the Artificial Intelligence (AI) language models trained in Spanish, Catalan, Galician, Valencian and Basque, which have been developed within ALIA, the public infrastructure of AI resources, are now available. Through the ALIA Kit users can access the entire family of models and learn about the methodology used, related documentation and training and evaluation datasets. In this article we tell you about its key features.

What is ALIA?

ALIA is a project coordinated by the Barcelona Supercomputing Center-Centro Nacional de Supercomputación (BSC-CNS). It aims to provide a public infrastructure of open and transparent artificial intelligence resources, capable of generating value in both the public and private sectors.

Specifically, ALIA is a family of text, speech and machine translation models. The training of artificial intelligence systems is computationally intensive, as huge volumes of data need to be processed and analysed. These models have been trained in Spanish, a language spoken by more than 600 million people worldwide, but also in the four co-official languages. The Real Academia Española (RAE) and the Asociación de Academias de la Lengua Española, which brings together the Spanish language institutions around the world, have collaborated in this project.

The MareNostrum 5, one of the most powerful supercomputers in the world, which is located at the Barcelona Supercomputing Center, has been used for the training. It has taken thousands of hours of work to process several billion words at a speed of 314,000 trillion calculations per second.

A family of open and transparent models

The development of these models provides an alternative that incorporates local data. One of ALIA's priorities is to be an open and transparent network, which means that users, in addition to being able to access the models, have the possibility of knowing and downloading the datasets used and all related documentation. This documentation makes it easier to understand how the models work and also to detect more easily where they fail, which is essential to avoid biases and erroneous results. Openness of models and transparency of data is essential, as it creates more inclusive and socially just models, which benefit society as a whole.

Having open and transparent models encourages innovation, research and democratises access to artificial intelligence, while ensuring that it is based on quality training data.

What can I find in ALIA Kit?

Through ALIA Kit, it is currently possible to access five massive language models (LLM) of general purpose, of which two have been trained with instructions from various open corpora. Also available are nine multilingual machine translation models, some of them trained from scratch, such as one for machine translation between Galician and Catalan, or between Basque and Catalan. In addition, translation models have been trained in Aranese, Aragonese and Asturian.

We also find the data and tools used to build and evaluate the text models, such as the massive CATalog textual corpus, consisting of 17.45 billion words (about 23 billion tokens), distributed over 34.8 million documents from a wide variety of sources, which have been largely manually reviewed.

To train the speech models, different speech corpora with transcription have been used, such as, for example, a dataset of the Valencian Parliament with more than 270 hours of recordings of its sessions. It is also possible to know the corpora used to train the machine translation models.

A freeAPI (from Python, Javascript or Curl) is also available through the ALIA Kit, with which tests can be carried out.

What can these models be used for?

The models developed by ALIA are designed to be adaptable to a wide range of natural language processing tasks. However, for specific needs it is preferable to use specialised models, which are more accurate and less resource-intensive.

As we have seen, the models are available to all interested users, such as independent developers, researchers, companies, universities or institutions. Among the main beneficiaries of these tools are developers and small and medium-sized enterprises, for whom it is not feasible to develop their own models from scratch, both for economic and technical reasons. Thanks to ALIA they can adapt existing models to their specific needs.

Developers will find resources to create applications that reflect the linguistic richness of Spanish and the co-official languages. For their part, companies will be able to develop new applications, products or services aimed at the broad international market offered by the Spanish language, opening up new business and expansion opportunities.

An innovative project financed with public funds

The ALIA project is fully publicly funded with the aim of fostering innovation and the adoption of value-generating technologies in both the public and private sectors. Having a public AI infrastructure democratises access to advanced technologies, allowing small businesses, institutions and governments to harness their full potential to innovate and improve their services. It also facilitates ethical oversight of AI development and encourages innovation.

ALIA is part of the Spain's Artificial Intelligence Strategy 2024, which aims to provide the country with the necessary capabilities to meet the growing demand for AI products and services and to boost the adoption of this technology, especially in the public sector and SMEs. Within Axis 1 of this strategy is the so-called Lever 3, which focuses on the generation of models and corpora for a public infrastructure of language models. With the publication of this family of models, advances in the development of artificial intelligence resources in Spain.

27/01/2025

The role of open data in the evolution of SLM and LLM: efficiency vs. power

Blog

Language models are at the epicentre of the technological paradigm shift that has been taking place in generative artificial intelligence (AI) over the last two years. From the tools with which we interact in natural language to generate text, images or videos and which we use to create creative content, design prototypes or produce educational material, to more complex applications in research and development that have even been instrumental in winning the 2024 Nobel Prize in Chemistry, language models are proving their usefulness in a wide variety of applicationsthat we are still exploring.

Since Google's influential 2017 paper "Attention is all you need" describing the architecture of the Transformers, the technology underpinning the new capabilities that OpenAI popularised in late 2022 with the launch of ChatGPT, the evolution of language models has been more than dizzying. In just two years, we have moved from models focused solely on text generation to multimodal versions that integrate interaction and generation of text, images and audio.

This rapid evolution has given rise to two categories of language models: SLMs (Small Language Models), which are lighter and more efficient, and LLLMs (Large Language Models), which are heavier and more powerful. Far from considering them as competitors, we should analyse SLM and LLM as complementary technologies. While LLLMs offer general processing and content generation capabilities, SLMs can provide support for more agile and specialised solutions for specific needs. However, both share one essential element: they rely on large volumes of data for training and at the heart of their capabilities is open data, which is part of the fuel used to train these language models on which generative AI applications are based.

LLLM: power driven by massive data

The LLLMs are large-scale language models with billions, even trillions, of parameters. These parameters are the mathematical units that allow the model to identify and learn patterns in the training data, giving them an extraordinary ability to generate text (or other formats) that is consistent and adapted to the users' context. These models, such as the GPT family from OpenAI, Gemini from Google or Llama from Meta, are trained on immense volumes of data and are capable of performing complex tasks, some even for which they were not explicitly trained.

Thus, LLMs are able to perform tasks such as generating original content, answering questions with relevant and well-structured information or generating software code, all with a level of competence equal to or higher than humans specialised in these tasks and always maintaining complex and fluent conversations.

The LLLMs rely on massive amounts of data to achieve their current level of performance: from repositories such as Common Crawl, which collects data from millions of web pages, to structured sources such as Wikipedia or specialised sets such as PubMed Open Access in the biomedical field. Without access to these massive bodies of open data, the ability of these models to generalise and adapt to multiple tasks would be much more limited.

However, as LLMs continue to evolve, the need for open data increases to achieve specific advances such as:

Increased linguistic and cultural diversity: although today's LLMs are multilingual, they are generally dominated by data in English and other major languages. The lack of open data in other languages limits the ability of these models to be truly inclusive and diverse. More open data in diverse languages would ensure that LLMs can be useful to all communities, while preserving the world's cultural and linguistic richness.
Reducción de sesgos: los LLM, como cualquier modelo de IA, son propensos a reflejar los sesgos presentes en los datos con los que se entrenan. This sometimes leads to responses that perpetuate stereotypes or inequalities. Incorporating more carefully selected open data, especially from sources that promote diversity and equality, is fundamental to building models that fairly and equitably represent different social groups.
Constant updating: Data on the web and other open resources is constantly changing. Without access to up-to-date data, the LLMs generate outdated responses very quickly. Therefore, increasing the availability of fresh and relevant open data would allow LLMs to keep in line with current events[9].
Entrenamiento más accesible: a medida que los LLM crecen en tamaño y capacidad, también lo hace el coste de entrenarlos y afinarlos. Open data allows independent developers, universities and small businesses to train and refine their own models without the need for costly data acquisitions. This democratises access to artificial intelligence and fosters global innovation.

To address some of these challenges, the new Artificial Intelligence Strategy 2024 includes measures aimed at generating models and corpora in Spanish and co-official languages, including the development of evaluation datasets that consider ethical evaluation.

SLM: optimised efficiency with specific data

On the other hand, SLMs have emerged as an efficient and specialised alternative that uses a smaller number of parameters (usually in the millions) and are designed to be lightweight and fast. Aunque no alcanzan la versatilidad y competencia de los LLM en tareas complejas, los SLM destacan por su eficiencia computacional, rapidez de implementación y capacidad para especializarse en dominios concretos.

For this, SLMs also rely on open data, but in this case, the quality and relevance of the datasets are more important than their volume, so the challenges they face are more related to data cleaning and specialisation. These models require sets that are carefully selected and tailored to the specific domain for which they are to be used, as any errors, biases or unrepresentativeness in the data can have a much greater impact on their performance. Moreover, due to their focus on specialised tasks, the SLMs face additional challenges related to the accessibility of open data in specific fields. For example, in sectors such as medicine, engineering or law, relevant open data is often protected by legal and/or ethical restrictions, making it difficult to use it to train language models.

The SLMs are trained with carefully selected data aligned to the domain in which they will be used, allowing them to outperform LLMs in accuracy and specificity on specific tasks, such as for example:

Text autocompletion: a SLM for Spanish autocompletion can be trained with a selection of books, educational texts or corpora such as those to be promoted in the aforementioned AI Strategy, being much more efficient than a general-purpose LLM for this task.
Legal consultations: a SLM trained with open legal datasets can provide accurate and contextualised answers to legal questions or process contractual documents more efficiently than a LLM.
Customised education: ein the education sector, SLM trained with open data teaching resources can generate specific explanations, personalised exercises or even automatic assessments, adapted to the level and needs of the student.
Medical diagnosis: An SLM trained with medical datasets, such as clinical summaries or open publications, can assist physicians in tasks such as identifying preliminary diagnoses, interpreting medical images through textual descriptions or analysing clinical studies.

Ethical Challenges and Considerations

We should not forget that, despite the benefits, the use of open data in language modelling presents significant challenges. One of the main challenges is, as we have already mentioned, to ensure the quality and neutrality of the data so that they are free of biases, as these can be amplified in the models, perpetuating inequalities or prejudices.

Even if a dataset is technically open, its use in artificial intelligence models always raises some ethical implications. For example, it is necessary to avoid that personal or sensitive information is leaked or can be deduced from the results generated by the models, as this could cause damage to the privacy of individuals.
The issue of data attribution and intellectual property must also be taken into account. The use of open data in business models must address how the original creators of the data are recognised and adequately compensated so that incentives for creators continue to exist.

Open data is the engine that drives the amazing capabilities of language models, both SLM and LLM. While the SLMs stand out for their efficiency and accessibility, the LLMs open doors to advanced applications that not long ago seemed impossible. However, the path towards developing more capable, but also more sustainable and representative models depends to a large extent on how we manage and exploit open data.

Contenido elaborado por Jose Luis Marín, Senior Consultant in Data, Strategy, Innovation & Digitalization. Los contenidos y los puntos de vista reflejados en esta publicación son responsabilidad exclusiva de su autor.

27/12/2024

Understanding Word Embeddings: how machines learn the meaning of words

Blog

Natural language processing (NLP) is a branch of artificial intelligence that allows machines to understand and manipulate human language. At the core of many modern applications, such as virtual assistants, machine translation and chatbots, are word embeddings. But what exactly are they and why are they so important?

What are word embeddings?

Word embeddings are a technique that allows machines to represent the meaning of words in such a way that complex relationships between words can be captured. To understand this, let's think about how words are used in a given context: a word acquires meaning depending on the words surrounding it. For example, the word bank can refer to a financial institution or to a headquarters, depending on the context in which it is found.

The idea behind word embeddings is that each word is assigned a vector in a multi-dimensional space. The position of these vectors in space reflects the semantic closeness between the words. If two words have similar meanings, their vectors will be close. If their meanings are opposite or unrelated, they are distant in vector space.

To visualise this, imagine that words like lake, river and ocean would be close together in this space, while words like lake and building would be much further apart. This structure enables language processing algorithms to perform complex tasks, such as finding synonyms, making accurate translations or even answering context-based questions.

How are word embeddings created?

The main objective of word embeddings is to capture semantic relationships and contextual information of words, transforming them into numerical representations that can be understood by machine learning algorithms. Instead of working with raw text, machines require words to be converted into numbers in order to identify patterns and relationships effectively.

The process of creating word embeddings consists of training a model on a large corpus of text, such as Wikipedia articles or news items, to learn the structure of the language. The first step involves performing a series of pre-processing on the corpus, which includes tokenise the words, removing punctuation and irrelevant terms, and, in some cases, converting the entire text to lower case to maintain consistency.

The use of context to capture meaning

Once the text has been pre-processed, a technique known as "sliding context window" is used to extract information. This means that, for each target word, the surrounding words within a certain range are taken into account. For example, if the context window is 3 words, for the word airplane in the sentence "The plane takes off at six o'clock", the context words will be The, takes off, to.

The model is trained to learn to predict a target word using the words in its context (or conversely, to predict the context from the target word). To do this, the algorithm adjusts its parameters so that the vectors assigned to each word are closer in vector space if those words appear frequently in similar contexts.

How models learn language structure

The creation of word embeddings is based on the ability of these models to identify patterns and semantic relationships. During training, the model adjusts the values of the vectors so that words that often share contexts have similar representations. For example, if airplane and helicopter are frequently used in similar phrases (e.g. in the context of air transport), the vectors of airplane and helicopter will be close together in vector space.

As the model processes more and more examples of sentences, it refines the positions of the vectors in the continuous space. Thus, the vectors reflect not only semantic proximity, but also other relationships such as synonyms, categories (e.g., fruits, animals) and hierarchical relationships (e.g., dog and animal).

A simplified example

Imagine a small corpus of only six words: guitar, bass, drums, piano, car and bicycle. Suppose that each word is represented in a three-dimensional vector space as follows:

guitar [0.3, 0.8, -0.1]

bass [0.4, 0.7, -0.2]

drums [0.2, 0.9, -0.1]

piano [0.1, 0.6, -0.3]

car [0.8, -0.1, 0.6]

bicycle [0.7, -0.2, 0.5]

In this simplified example, the words guitar, bass, drums and piano represent musical instruments and are located close to each other in vector space, as they are used in similar contexts. In contrast, car and bicycle, which belong to the category of means of transport, are distant from musical instruments but close to each other. This other image shows how different terms related to sky, wings and engineering would look like in a vector space.

Figure 1. Examples of representation of a corpus in a vector space. Source: Adapted from “Word embeddings: the (very) basics”, by Guillaume Desagulier.

This example only uses three dimensions to illustrate the idea, but in practice, word embeddings usually have between 100 and 300 dimensions to capture more complex semantic relationships and linguistic nuances.

The result is a set of vectors that efficiently represent each word, allowing language processing models to identify patterns and semantic relationships more accurately. With these vectors, machines can perform advanced tasks such as semantic search, text classification and question answering, significantly improving natural language understanding.

Strategies for generating word embeddings

Over the years, multiple approaches and techniques have been developed to generate word embeddings. Each strategy has its own way of capturing the meaning and semantic relationships of words, resulting in different characteristics and uses. Some of the main strategies are presented below:

1. Word2Vec: local context capture

Developed by Google, Word2Vec is one of the most popular approaches and is based on the idea that the meaning of a word is defined by its context. It uses two main approaches:

CBOW (Continuous Bag of Words): In this approach, the model predicts the target word using the words in its immediate environment. For example, given a context such as "The dog is ___ in the garden", the model attempts to predict the word playing, based on the words The, dog, is and garden.
Skip-gram: Conversely, Skip-gram uses a target word to predict the surrounding words. Using the same example, if the target word is playing, the model would try to predict that the words in its environment are The, dog, is and garden.

The key idea is that Word2Vec trains the model to capture semantic proximity across many iterations on a large corpus of text. Words that tend to appear together have closer vectors, while unrelated words appear further apart.

2. GloVe: global statistics-based approach

GloVe, developed at Stanford University, differs from Word2Vec by using global co-occurrence statistics of words in a corpus. Instead of considering only the immediate context, GloVe is based on the frequency with which two words appear together in the whole corpus.

For example, if bread and butter appear together frequently, but bread and planet are rarely found in the same context, the model adjusts the vectors so that bread and butter are close together in vector space.

This allows GloVe to capture broader global relationships between words and to make the representations more robust at the semantic level. Models trained with GloVe tend to perform well on analogy and word similarity tasks.

3. FastText: subword capture

FastText, developed by Facebook, improves on Word2Vec by introducing the idea of breaking down words into sub-words. Instead of treating each word as an indivisible unit, FastText represents each word as a sum of n-grams. For example, the word playing could be broken down into play, ayi, ing, and so on.

This allows FastText to capture similarities even between words that did not appear explicitly in the training corpus, such as morphological variations (playing, play, player). This is particularly useful for languages with many grammatical variations.

4. Embeddings contextuales: dynamic sense-making

Models such as BERT and ELMo represent a significant advance in word embeddings. Unlike the previous strategies, which generate a single vector for each word regardless of the context, contextual embeddings generate different vectors for the same word depending on its use in the sentence.

For example, the word bank will have a different vector in the sentence "I sat on the park bench" than in "the bank approved my credit application". This variability is achieved by training the model on large text corpora in a bidirectional manner, i.e. considering not only the words preceding the target word, but also those following it.

Practical applications of word embeddings

ord embeddings are used in a variety of natural language processing applications, including:

Named Entity Recognition (NER): allows you to identify and classify names of people, organisations and places in a text. For example, in the sentence "Apple announced its new headquarters in Cupertino", the word embeddings allow the model to understand that Apple is an organisation and Cupertino is a place.
Automatic translation: helps to represent words in a language-independent way. By training a model with texts in different languages, representations can be generated that capture the underlying meaning of words, facilitating the translation of complete sentences with a higher level of semantic accuracy.
Information retrieval systems: in search engines and recommender systems, word embeddings improve the match between user queries and relevant documents. By capturing semantic similarities, they allow even non-exact queries to be matched with useful results. For example, if a user searches for "medicine for headache", the system can suggest results related to analgesics thanks to the similarities captured in the vectors.
Q&A systems: word embeddings are essential in systems such as chatbots and virtual assistants, where they help to understand the intent behind questions and find relevant answers. For example, for the question "What is the capital of Italy?", the word embeddings allow the system to understand the relationship between capital and Italy and find Rome as an answer.
Sentiment analysis: word embeddings are used in models that determine whether the sentiment expressed in a text is positive, negative or neutral. By analysing the relationships between words in different contexts, the model can identify patterns of use that indicate certain feelings, such as joy, sadness or anger.
Semantic clustering and similarity detection: word embeddings also allow you to measure the semantic similarity between documents, phrases or words. This is used for tasks such as grouping related items, recommending products based on text descriptions or even detecting duplicates and similar content in large databases.

Conclusion

Word embeddings have transformed the field of natural language processing by providing dense and meaningful representations of words, capable of capturing their semantic and contextual relationships. With the emergence of contextual embeddings , the potential of these representations continues to grow, allowing machines to understand even the subtleties and ambiguities of human language. From applications in translation and search systems, to chatbots and sentiment analysis, word embeddings will continue to be a fundamental tool for the development of increasingly advanced and humanised natural language technologies.

Content prepared by Juan Benavente, senior industrial engineer and expert in technologies linked to the data economy. The contents and points of view reflected in this publication are the sole responsibility of the author.

24/10/2024

SLM, LLM, RAG and Fine-tuning: Pillars of Modern Generative AI

Blog

In the fast-paced world of Generative Artificial Intelligence (AI), there are several concepts that have become fundamental to understanding and harnessing the potential of this technology. Today we focus on four: Small Language Models(SLM), Large Language Models(LLM), Retrieval Augmented Generation(RAG) and Fine-tuning. In this article, we will explore each of these terms, their interrelationships and how they are shaping the future of generative AI.

Let us start at the beginning. Definitions

Before diving into the details, it is important to understand briefly what each of these terms stands for:

The first two concepts (SLM and LLM) that we address are what are known as language models. A language model is an artificial intelligence system that understands and generates text in human language, as do chatbots or virtual assistants. The following two concepts (Fine Tuning and RAG) could be defined as optimisation techniques for these previous language models. Ultimately, these techniques, with their respective approaches as discussed below, improve the answers and the content returned to the questioner. Let's go into the details:

SLM (Small Language Models): More compact and specialised language models, designed for specific tasks or domains.
LLM (Large Language Models): Large-scale language models, trained on vast amounts of data and capable of performing a wide range of linguistic tasks.
RAG (Retrieval-Augmented Generation): A technique that combines the retrieval of relevant information with text generation to produce more accurate and contextualised responses.
Fine-tuning: The process of tuning a pre-trained model for a specific task or domain, improving its performance in specific applications.

Now, let's dig deeper into each concept and explore how they interrelate in the Generative AI ecosystem.

Figure 1. Pillars of Generative AI. Own elaboration.

SLM: The power of specialisation

Increased efficiency for specific tasks

Small Language Models (SLMs) are AI models designed to be lighter and more efficient than their larger counterparts. Although they have fewer parameters, they are optimised for specific tasks or domains.

Key characteristics of SLMs:

Computational efficiency: They require fewer resources for training and implementation.
Specialisation: They focus on specific tasks or domains, achieving high performance in specific areas.
Rapid implementation: Ideal for resource-constrained devices or applications requiring real-time responses.
Lower carbon footprint: Being smaller, their training and use consumes less energy.

SLM applications:

Virtual assistants for specific tasks (e.g. booking appointments).
Personalised recommendation systems.
Sentiment analysis in social networks.
Machine translation for specific language pairs.

LLM: The power of generalisation

The revolution of Large Language Models

LLMs have transformed the Generative AI landscape, offering amazing capabilities in a wide range of language tasks.

Key characteristics of LLMs:

Vast amounts of training data: They train with huge corpuses of text, covering a variety of subjects and styles.
Complex architectures: They use advanced architectures, such as Transformers, with billions of parameters.
Generalisability: They can tackle a wide variety of tasks without the need for task-specific training.
Contextual understanding: They are able to understand and generate text considering complex contexts.

LLM applications:

Generation of creative text (stories, poetry, scripts).
Answers to complex questions and reasoning.
Analysis and summary of long documents.
Advanced multilingual translation.

RAG: Boosting accuracy and relevance

The synergy between recovery and generation

As we explored in our previous article, RAG combines the power of information retrieval models with the generative capacity of LLMs. Its key aspects are:

Key features of RAG:

Increased accuracy of responses.
Capacity to provide up-to-date information.
Reduction of "hallucinations" or misinformation.
Adaptability to specific domains without the need to completely retrain the model.

RAG applications:

Advanced customer service systems.
Academic research assistants.
Fact-checking tools for journalism.
AI-assisted medical diagnostic systems.

Fine-tuning: Adaptation and specialisation

Refining models for specific tasks

Fine-tuning is the process of adjusting a pre-trained model (usually an LLM) to improve its performance in a specific task or domain. Its main elements are as follows:

Key features of fine-tuning:

Significant improvement in performance on specific tasks.
Adaptation to specialised or niche domains.
Reduced time and resources required compared to training from scratch.
Possibility of incorporating specific knowledge of the organisation or industry.

Fine-tuning applications:

Industry-specific language models (legal, medical, financial).
Personalised virtual assistants for companies.
Content generation systems tailored to particular styles or brands.
Specialised data analysis tools.

Here are a few examples

Many of you familiar with the latest news in generative AI will be familiar with these examples below.

SLM: The power of specialisation

Ejemplo: BERT for sentiment analysis

BERT (Bidirectional Encoder Representations from Transformers) is an example of SLM when used for specific tasks. Although BERT itself is a large language model, smaller, specialised versions of BERT have been developed for sentiment analysis in social networks.

For example, DistilBERT, a scaled-down version of BERT, has been used to create sentiment analysis models on X (Twitter). These models can quickly classify tweets as positive, negative or neutral, being much more efficient in terms of computational resources than larger models.

LLM: The power of generalisation

Ejemplo: OpenAI GPT-3

GPT-3 (Generative Pre-trained Transformer 3) is one of the best known and most widely used LLMs. With 175 billion parameters, GPT-3 is capable of performing a wide variety of natural language processing tasks without the need for task-specific training.

A well-known practical application of GPT-3 is ChatGPT, OpenAI's conversational chatbot. ChatGPT can hold conversations on a wide variety of topics, answer questions, help with writing and programming tasks, and even generate creative content, all using the same basic model.

Already at the end of 2020 we introduced the first post on GPT-3 as a great language model. For the more nostalgic ones, you can check the original post here.

RAG: Boosting accuracy and relevance

Ejemplo: Anthropic's virtual assistant, Claude

Claude, the virtual assistant developed by Anthropic, is an example of an application using RAGtechniques. Although the exact details of its implementation are not public, Claude is known for his ability to provide accurate and up-to-date answers, even on recent events.

In fact, most generative AI-based conversational assistants incorporate RAG techniques to improve the accuracy and context of their responses. Thus, ChatGPT, the aforementioned Claude, MS Bing and the like use RAG.

Fine-tuning: Adaptation and specialisation

Ejemplo: GPT-3 fine-tuned for GitHub Copilot

GitHub Copilot, the GitHub and OpenAI programming assistant, is an excellent example of fine-tuning applied to an LLM. Copilot is based on a GPT model (possibly a variant of GPT-3) that has been specificallyfine-tunedfor scheduling tasks.

The base model was further trained with a large amount of source code from public GitHub repositories, allowing it to generate relevant and syntactically correct code suggestions in a variety of programming languages. This is a clear example of how fine-tuning can adapt a general purpose model to a highly specialised task.

Another example: in the datos.gob.es blog, we also wrote a post about applications that used GPT-3 as a base LLM to build specific customised products.

Interrelationships and synergies

These four concepts do not operate in isolation, but intertwine and complement each other in the Generative AI ecosystem:

SLM vs LLM: While LLMs offer versatility and generalisability, SLMs provide efficiency and specialisation. The choice between one or the other will depend on the specific needs of the project and the resources available.
RAG and LLM: RAG empowers LLMs by providing them with access to up-to-date and relevant information. This improves the accuracy and usefulness of the answers generated.
Fine-tuning and LLM: Fine-tuning allows generic LLMs to be adapted to specific tasks or domains, combining the power of large models with the specialisation needed for certain applications.
RAG and Fine-tuning: These techniques can be combined to create highly specialised and accurate systems. For example, a LLM with fine-tuning for a specific domain can be used as a generative component in a RAGsystem.
SLM and Fine-tuning: Fine-tuning can also be applied to SLM to further improve its performance on specific tasks, creating highly efficient and specialised models.

Conclusions and the future of AI

The combination of these four pillars is opening up new possibilities in the field of Generative AI:

Hybrid systems: Combination of SLM and LLM for different aspects of the same application, optimising performance and efficiency.
AdvancedRAG : Implementation of more sophisticated RAG systems using multiple information sources and more advanced retrieval techniques.
Continuousfine-tuning : Development of techniques for the continuous adjustment of models in real time, adapting to new data and needs.
Personalisation to scale: Creation of highly customised models for individuals or small groups, combining fine-tuning and RAG.
Ethical and responsible Generative AI: Implementation of these techniques with a focus on transparency, verifiability and reduction of bias.

SLM, LLM, RAG and Fine-tuning represent the fundamental pillars on which the future of Generative AI is being built. Each of these concepts brings unique strengths:

SLMs offer efficiency and specialisation.
LLMs provide versatility and generalisability.
RAG improves the accuracy and relevance of responses.
Fine-tuning allows the adaptation and customisation of models.

The real magic happens when these elements combine in innovative ways, creating Generative AI systems that are more powerful, accurate and adaptive than ever before. As these technologies continue to evolve, we can expect to see increasingly sophisticated and useful applications in a wide range of fields, from healthcare to creative content creation.

The challenge for developers and researchers will be to find the optimal balance between these elements, considering factors such as computational efficiency, accuracy, adaptability and ethics. The future of Generative AI promises to be fascinating, and these four concepts will undoubtedly be at the heart of its development and application in the years to come.

Content prepared by Alejandro Alija, expert in Digital Transformation and Innovation. The contents and points of view reflected in this publication are the sole responsibility of its author.

10/09/2024

RAG techniques: how they work and examples of use cases

Blog

In recent months we have seen how the large language models (LLMs ) that enable Generative Artificial Intelligence (GenAI) applications have been improving in terms of accuracy and reliability. RAG (Retrieval Augmented Generation) techniques have allowed us to use the full power of natural language communication (NLP) with machines to explore our own knowledge bases and extract processed information in the form of answers to our questions. In this article we take a closer look at RAG techniques in order to learn more about how they work and all the possibilities they offer in the context of generative AI.

What are RAG techniques?

This is not the first time we have talked about RAG techniques. In this article we have already introduced the subject, explaining in a simple way what they are, what their main advantages are and what benefits they bring in the use of Generative AI.

Let us recall for a moment its main keys. RAG is translated as Retrieval Augmented Generation . In other words, RAG consists of the following: when a user asks a question -usually in a conversational interface-, the Artificial Intelligence (AI), before providing a direct answer -which it could give using the (fixed) knowledge base with which it has been trained-, carries out a process of searching and processing information in a specific database previously provided, complementary to that of the training. When we talk about a database, we refer to a knowledge base previously prepared from a set of documents that the system will use to provide more accurate answers. Thus, when using RAGtechniques, conversational interfaces produce more accurate and context-specific responses.

Source: Own preparation.

Conceptual diagram of the operation of a conversational interface or assistant without using RAG (top) and using RAG (bottom).

Drawing a comparison with the medical field, we could say that the use of RAG is as if a doctor, with extensive experience and therefore highly trained, in addition to the knowledge acquired during his academic training and years of experience, has quick and effortless access to the latest studies, analyses and medical databases instantly, before providing a diagnosis. Academic training and years of experience are equivalent to large language model (LLM) training and the "magic" access to the latest studies and specific databases can be assimilated to what RAG techniques provide.

Evidently, in the example we have just given, good medical practice makes both elements indispensable, and the human brain knows how to combine them naturally, although not without effort and time, even with today's digital tools, which make the search for information easier and more immediate.

RAG in detail

RAG Fundamentals

RAG combines two phases to achieve its objective: recovery and generation. In the first, relevant documents are searched for in a database containing information relevant to the question posed (e.g. a clinical database or a knowledge base of commonly asked questions and answers). In the second, an LLM is used to generate a response based on the retrieved documents. This approach ensures that responses are not only consistent but also accurate and supported by verifiable data.

Components of the RAG System

In the following, we will describe the components that a RAG algorithm uses to fulfil its function. For this purpose, for each component, we will explain what function it fulfils, which technologies are used to fulfil this function and an example of the part of the RAG process in which that component is involved.

Recovery Model:
- Function: Identifies and retrieves relevant documents from a large database in response to a query.
- Technology: It generally uses Information Retrieval (IR) techniques such as BM25 or embedding-based retrieval models such as Dense Passage Retrieval (DPR).
- Process: Given a question, the retrieval model searches a database to find the most relevant documents and presents them as context for answer generation.
Generation Model:
- Function: Generate coherent and contextually relevant answers using the retrieved documents.
- Technology: Based on some of the major Large Language Models (LLM) such as GPT-3.5, T5, or BERT, Llama.
- Process: The generation model takes the user's query and the retrieved documents and uses this combined information to produce an accurate response.

Detailed RAG Process

For a better understanding of this section, we recommend the reader to read this previous work in which we explain in a didactic way the basics of natural language processing and how we teach machines to read. In detail, a RAG algorithm performs the following steps:

Reception of the question. The system receives a question from the user. This question is processed to extract keywords and understand the intention.
Document retrieval. The question is sent to the recovery model.
- Example of Retrieval based on embeddings:
  1. The question is converted into a vector of embeddings using a pre-trained model.
  2. This vector is compared with the document vectors in the database.
  3. The documents with the highest similarity are selected.
- Example of BM25:
  1. The question is tokenised and the keywords are compared with the inverted indexes in the database.
  2. The most relevant documents are retrieved according to a relevance score.
Filtering and sorting. The retrieved documents are filtered to eliminate redundancies and to classify them according to their relevance. Additional techniques such as reranking can be applied using more sophisticated models.
Response generation. The filtered documents are concatenated with the user's question and fed into the generation model. The LLM uses the combined information to generate an answer that is coherent and directly relevant to the question. For example, if we use GPT-3.5 as LLM, the input to the model includes both the user's question and fragments of the retrieved documents. Finally, the model generates text using its ability to understand the context of the information provided.

In the following section we will look at some applications where Artificial Intelligence and large language models play a differentiating role and, in particular, we will analyse how these use cases benefit from the application of RAGtechniques.

Examples of use cases that benefit substantially from using RAG vs. not using RAG

1. ECommerceCustomer Service

No RAG:
- A basic chatbot can give generic and potentially incorrect answers about return policies.
- Example: Please review our returns policy on the website.
With RAG:
- The chatbot accesses the database of updated policies and provides a specific and accurate response.
- Example: You may return products within 30 days of purchase, provided they are in their original packaging. See more details [here].

2. Medical Diagnosis

No RAG:
- A virtual health assistant could offer recommendations based only on their previous training, without access to the latest medical information.
- Example: You may have the flu. Consult your doctor
With RAG:
- The wizard can retrieve information from recent medical databases and provide a more accurate and up-to-date diagnosis.
- Example: Based on your symptoms and recent studies published in PubMed, you could be dealing with a viral infection. Consult your doctor for an accurate diagnosis.

3. Academic Research Assistance

No RAG:
- A researcher receives answers limited to what the model already knows, which may not be sufficient for highly specialised topics.
- Example: Economic growth models are important for understanding the economy.
With RAG:
- The wizard retrieves and analyses relevant academic articles, providing detailed and accurate information.
- Example: According to the 2023 study in the Journal of Economic Growth, the XYZ model has been shown to be 20% more accurate in predicting economic trends in emerging markets.

4. Journalism

No RAG:
- A journalist receives generic information that may not be up to date or accurate.
- Example Artificial intelligence is changing many industries.
With RAG:
- The wizard retrieves specific data from recent studies and articles, providing a solid basis for the article.
- Example: According to a 2024 report by 'TechCrunch', AI adoption in the financial sector has increased by 35% in the last year, improving operational efficiency and reducing costs.

Of course, for most of us who have experienced the more accessible conversational interfaces, such as ChatGPT, Gemini o Bing we can see that the answers are usually complete and quite precise when it comes to general questions. This is because these agents make use of AGN methods and other advanced techniques to provide the answers. However, it is not long before conversational assistants, such as Alexa, Siri u OK Google provided extremely simple answers and very similar to those explained in the previous examples when not making use of RAG.

Conclusions

Retrieval Augmented Generation (RAG) techniques improve the accuracy and relevance of language model answers by combining document retrieval and text generation. Using retrieval methods such as BM25 or DPR and advanced language models, RAG provides more contextualised, up-to-date and accurate responses.Today, RAG is the key to the exponential development of AI in the private data domain of companies and organisations. In the coming months, RAG is expected to see massive adoption in a variety of industries, optimising customer care, medical diagnostics, academic research and journalism, thanks to its ability to integrate relevant and current information in real time.

Content prepared by Alejandro Alija, expert in Digital Transformation and Innovation. The contents and points of view reflected in this publication are the sole responsibility of its author.

21/08/2024

Linguistic corpora: the knowledge engine for AI

Blog

The transfer of human knowledge to machine learning models is the basis of all current artificial intelligence. If we want AI models to be able to solve tasks, we first have to encode and transmit solved tasks to them in a formal language that they can process. We understand as a solved task information encoded in different formats, such as text, image, audio or video. In the case of language processing, and in order to achieve systems with a high linguistic competence so that they can communicate with us in an agile way, we need to transfer to these systems as many human productions in text as possible. We call these data sets the corpus.

Corpus: text datasets

When we talk about corpora (its Latin plural) or datasets that have been used to train Large Language Models (LLMs) such as GPT-4, we are talking about books of all kinds, content written on websites, large repositories of text and information in the world such as Wikipedia, but also less formal linguistic productions such as those we write on social networks, in public reviews of products or services, or even in emails. This variety allows these language models to process and handle text in different languages, registers and styles.

For people working in Natural Language Processing (NLP), data science and data engineering, great enablers like Kaggle or repositories like Awesome Public Datasets on GitHub, which provide direct access to download public datasets. Some of these data files have been prepared for processing and are ready for analysis, while others are in an unstructured state, which requires prior cleaning and sorting before they can be worked with. While also containing quantitative numerical data, many of these sources present textual data that can be used to train language models.

The problem of legitimacy

One of the complications we have encountered in creating these models is that text data that is published on the internet and has been collected via API (direct connections that allow mass downloading from a website or repository) or other techniques, are not always in the public domain. In many cases, they are copyrighted: writers, translators, journalists, content creators, scriptwriters, illustrators, designers and also musicians claim licensing fees from the big tech companies for the use of their text and image content to train models. The media, in particular, are actors greatly impacted by this situation, although their positioning varies according to their situation and different business decisions. There is therefore a need for open corpora that can be used for these training tasks, without prejudice to intellectual property.

Characteristics suitable for a training corpus

Most of the characteristics, which have traditionally have traditionally defined a good corpus in linguistic in linguistic research have not changed when these text datasets are now used to train language models.

It is still beneficial to use whole texts rather than fragments to ensure coherence.
Texts must be authentic, from linguistic reality and natural language situations, retrievable and verifiable.
It is important to ensure a wide diversity in the provenance of texts in terms of sectors of society, publications, local varieties of languages and issuers or speakers.
In addition to general language, a wide variety of specialised language, technical terms and texts specific to different areas of knowledge should be included.
Register is fundamental in a language, so we must cover both formal and informal register, in its extremes and intermediate regions.
Language must be well-formed to avoid interference in learning, so it is desirable to remove code marks, numbers or symbols that correspond to digital metadata and not to the natural formation of the language.

Like specific recommendations for the formats of the files that are to form part of these corpora to be part of these corpora, we find that text corpora with annotations should be stored in UTF-8 encoding and in JSON or CSV format, not in PDF. The preferred format of the sound corpus is WAV 16 bit, 16 KHz. (for voice) or 44.1 KHz (for music and audio). Video corpora should be compiled in MPEG-4 (MP4) format, and translation memories in TMX or CSV.

The text as a collective heritage

National libraries in Europe are actively digitising their rich repositories of history and culture, ensuring public access and preservation. Institutions such as the National Library of France or the British Library are leading the way with initiatives that digitise everything from ancient manuscripts to current web publications. This digital hoarding not only protects heritage from physical deterioration, but also democratises access for researchers and the public and, for some years now, also allows the collection of training corpora for artificial intelligence models.

The corpora provided officially by national libraries allow text collections to be used to create public technology available to all: a collective cultural heritage that generates a new collective heritage, this time a technologicalone. The gain is greatest when these institutional corpora do focus on complying with intellectual property laws, providing only open data and texts free of copyright restrictions, with prescribed or licensed rights. This, coupled with the encouraging fact that the amount of real data needed to train language models is hopefully decreasing as technology advances models is decreasing as technology advances, e.g. with the generation ofadvances, for example, with the generation of synthetic data or the optimisation of certain parameters, indicates that it is possible to train large text models without infringing on intellectual property laws operating in Europe.

In particular, the Biblioteca Nacional de España is making a major digitisation effort to make its valuable text repositories available for research, and in particular for language technologies. Since the first major mass digitisation of physical collections in 2008, the BNE has opened up access to millions of documents with the sole aim of sharing and universalising knowledge. In 2023, thanks to investment from the European Union's Recovery, Transformation and Resilience funds, the BNE is promoting a new digital preservation project in its Strategic Plan 2023-2025the plan focuses on four axes:

the massive and systematic digitisation of collections,
BNELab as a catalyst for innovation and data reuse in digital ecosystems,
partnerships and new cooperation environments,
and technological integration and sustainability.

The alignment of these four axes with new artificial intelligence and natural language processing technologies is more than obvious, as one of the main data reuses is the training of large language models. Both the digitised bibliographic records and the Library's cataloguing indexes are valuable materials for knowledge technology.

Spanish language models

In 2020, as a pioneering and relatively early initiative, in Spain the following was introduced MarIA a language model promoted by the Secretary of State for Digitalisation and Artificial Intelligence and developed by the National Supercomputing Centre (BSC-CNS), based on the archives of the National Library of Spain. In this case, the corpus was composed of texts from web pages, which had been collected by the BNE since 2009 and which had served to nourish a model originally based on GPT-2.

A lot has happened between the creation of MarIA and the announcement at the announcement at the 2024 Mobile World Congress of the construction of a great foundational language model, specifically trained in Spanish and co-official languages. This system will be open source and transparent, and will only use royalty-free content in its training. This project is a pioneer at European level, as it seeks to provide an open, public and accessible language infrastructure for companies. Like MarIA, the model will be developed at the BSC-CNS, working together with the Biblioteca Nacional de España and other actors such as the Academia Española de la Lengua and the Asociación de Academias de la Lengua Española.

In addition to the institutions that can provide linguistic or bibliographic collections, there are many more institutions in Spain that can provide quality corpora that can also be used for training models in Spanish. The Study on reusable data as a language resource, published in 2019 within the framework of the Language Technologies Plan, already pointed to different sources: the patents and technical reports of the Spanish and European Patent and Trademark Office, the terminology dictionaries of the Terminology Centre, or data as elementary as the census of the National Statistics Institute, or the place names of the National Geographic Institute. When it comes to audiovisual content, which can be transcribed for reuse, we have the video archive of RTVE A la carta, the Audiovisual Archive of the Congress of Deputies or the archives of the different regional television stations. The Boletín Oficial del Estado itself and its associated materials are an important source of textual information containing extensive knowledge about our society and its functioning. Finally, in specific areas such as health or justice, we have the publications of the Spanish Agency of Medicines and Health Products, the jurisprudence texts of the CENDOJ or the recordings of court hearings of the General Council of the Judiciary.

European initiatives

In Europe there does not seem to be as clear a precedent as MarIA or the upcoming GPT-based model in Spanish, as state-driven projects trained with heritage data, coming from national libraries or official bodies.

However, in Europe there is good previous work on the availability of documentation that could now be used to train European-founded AI systems. A good example is the europeana project, which seeks to digitise and make accessible the cultural and artistic heritage of Europe as a whole. It is a collaborative initiative that brings together contributions from thousands of museums, libraries, archives and galleries, providing free access to millions of works of art, photographs, books, music pieces and videos. Europeana has almost 25 million documents in text, which could be the basis for creating multilingual or multilingual competent foundational models in the different European languages.

There are also non-governmental initiatives, but with a global impact, such as Common Corpus which are the ultimate proof that it is possible to train language models with open data and without infringing copyright laws. Common Corpus was released in March 2024, and is the largest dataset created for training large language models, with 500 billion words from various cultural heritage initiatives. This corpus is multilingual and is the largest to date in English, French, Dutch, Spanish, German, Italian and French.

And finally, beyond text, it is possible to find initiatives in other formats such as audio, which can also be used to train AI models. In 2022, the National Library of Sweden provided a sound corpus of more than two million hours of recordings from local public radio, podcasts and audiobooks. The aim of the project was to generate an AI-based model of language-competent audio-to-text transcription that maximises the number of speakers to achieve a diverse and democratic dataset available to all.

Until now, the sense of collectivity and heritage has been sufficient in collecting and making data in text form available to society. With language models, this openness achieves a greater benefit: that of creating and maintaining technology that brings value to people and businesses, fed and enhanced by our own linguistic productions.

Content prepared by Carmen Torrijos, expert in AI applied to language and communication. The contents and points of view reflected in this publication are the sole responsibility of the author.

16/05/2024

GRAPH QL. Your best ally for the creation of data products.

Blog

The era of digitalisation in which we find ourselves has filled our daily lives with data products or data-driven products. In this post we discover what they are and show you one of the key data technologies to design and build this kind of products: GraphQL.

Introduction

Let's start at the beginning, what is a data product? A data product is a digital container (a piece of software) that includes data, metadata and certain functional logics (what and how I handle the data). The aim of such products is to facilitate users' interaction with a set of data. Some examples are:

Sales scorecard: Online businesses have tools to track their sales performance, with graphs showing trends and rankings, to assist in decision making.
Apps for recommendations: Streaming TV services have functionalities that show content recommendations based on the user's historical tastes.
Mobility apps. The mobile apps of new mobility services (such as Cabify, Uber, Bolt, etc.) combine user and driver data and metadata with predictive algorithms, such as dynamic fare calculation or optimal driver assignment, in order to offer a unique user experience.
Health apps: These applications make massive use of data captured by technological gadgets (such as the device itself, smart watches, etc.) that can be integrated with other external data such as clinical records and diagnostic tests.
Environmental monitoring: There are apps that capture and combine data from weather forecasting services, air quality systems, real-time traffic information, etc. to issue personalised recommendations to users (e.g. the best time to schedule a training session, enjoy the outdoors or travel by car).

As we can see, data products accompany us on a daily basis, without many users even realising it. But how do you capture this vast amount of heterogeneous information from different technological systems and combine it to provide interfaces and interaction paths to the end user? This is where GraphQL positions itself as a key technology to accelerate the creation of data products, while greatly improving their flexibility and adaptability to new functionalities desired by users.

What is GraphQL?

GraphQL saw the light of day on Facebook in 2012 and was released as Open Source in 2015. It can be defined as a language and an interpreter of that language, so that a developer of data products can invent a way to describe his product based on a model (a data structure) that makes use of the data available through APIs.

Before the advent of GraphQL, we had (and still have) the technology REST, which uses the HTTPs protocol to ask questions and get answers based on the data. In 2021, we introduced a post where we presented the technology and made a small demonstrative example of how it works. In it, we explain REST API as the standard technology that supports access to data by computer programs. We also highlight how REST is a technology fundamentally designed to integrate services (such as an authentication or login service).

In a simple way, we can use the following analogy. It is as if REST is the mechanism that gives us access to a complete dictionary. That is, if we need to look up any word, we have a method of accessing the dictionary, which is alphabetical search. It is a general mechanism for finding any available word in the dictionary. However, GraphQL allows us, beforehand, to create a dictionary model for our use case (known as a "data model"). So, for example, if our final application is a recipe book, what we do is select a subset of words from the dictionary that are related to recipes.

To use GraphQL, data must always be available via an API. GraphQL provides a complete and understandable description of the API data, giving clients (human or application) the possibility to request exactly what they need. As quoted in this post, GraphQL is like an API to which we add a SQL-style "Where" statement.

Below, we take a closer look at GraphQL's strengths when the focus is on the development of data products.

Benefits of using GraphQL in data products:

With GraphQL, the amount of data and queries on the APIs is considerably optimised . APIs for accessing certain data are not intended for a specific product (or use case) but as a general access specification (see dictionary example above). This means that, on many occasions, in order to access a subset of the data available in an API, we have to perform several chained queries, discarding most of the information along the way. GraphQL optimises this process, as it defines a predefined (but adaptable in the future) consumption model over a technical API. Reducing the amount of data requested has a positive impact on the rationalisation of computing resources, such as bandwidth or caches, and improves the speed of response of systems.
This has an immediate effect on the standardisation of data access. The model defined thanks to GraphQL creates a data consumption standard for a family of use cases. Again, in the context of a social network, if what we want is to identify connections between people, we are not interested in a general mechanism of access to all the people in the network, but a mechanism that allows us to indicate those people with whom I have some kind of connection. This kind of data access filter can be pre-configured thanks to GraphQL.
Improved safety and performance: By precisely defining queries and limiting access to sensitive data, GraphQL can contribute to a more secure and better performing application.

Thanks to these advantages, the use of this language represents a significant evolution in the way of interacting with data in web and mobile applications, offering clear advantages over more traditional approaches such as REST.

Generative Artificial Intelligence. A new superhero in town.

If the use of GraphQL language to access data in a much more efficient and standard way is a significant evolution for data products, what will happen if we can interact with our product in natural language? This is now possible thanks to the explosive evolution in the last 24 months of LLMs (Large Language Models) and generative AI.

The following image shows the conceptual scheme of a data product, intLegrated with LLMS: a digital container that includes data, metadata and logical functions that are expressed as functionalities for the user, together with the latest technologies to expose information in a flexible way, such as GraphQL and conversational interfaces built on top of Large Language Models (LLMs).

How can data products benefit from the combination of GraphQL and the use of LLMs?

Improved user experience. By integrating LLMs, people can ask questions to data products using natural language, . This represents a significant change in how we interact with data, making the process more accessible and less technical. In a practical way, we will replace the clicks with phrases when ordering a taxi.
Security improvements along the interaction chain in the use of a data product. For this interaction to be possible, a mechanism is needed that effectively connects the backend (where the data resides) with the frontend (where the questions are asked). GraphQL is presented as the ideal solution due to its flexibility and ability to adapt to the changing needs of users,offering a direct and secure link between data and questions asked in natural language. That is, GraphQl can pre-select the data to be displayed in a query, thus preventing the general query from making some private or unnecessary data visible for a particular application.
Empowering queries with Artificial Intelligence: The artificial intelligence not only plays a role in natural language interaction with the user. One can think of scenarios where the very model that is defined with GraphQL is assisted by artificial intelligence itself. This would enrich interactions with data products, allowing a deeper understanding and richer exploration of the information available. For example, we can ask a generative AI (such as ChatGPT) to take this catalogue data that is exposed as an API and create a GraphQL model and endpoint for us.

In short, the combination of GraphQL and LLMs represents a real evolution in the way we access data. GraphQL's integration with LLMs points to a future where access to data can be both accurate and intuitive, marking a move towards more integrated information systems that are accessible to all and highly reconfigurable for different use cases. This approach opens the door to a more human and natural interaction with information technologies, aligning artificial intelligence with our everyday experiences of communicating using data products in our day-to-day lives.

Content prepared by Alejandro Alija, expert in Digital Transformation and Innovation.

The contents and points of view reflected in this publication are the sole responsibility of its author.

09/04/2024

UNE specifications as a complement to ISO standards for the governance, management and quality of Information Systems and Technologies

Blog

Standardisation is essential to improve efficiency and interoperability in governance and data management. The adoption of standards provides a common framework for organising, exchanging and interpreting data, facilitating collaboration and ensuring data consistency and quality. The ISO standards, developed at international level, and the UNE norms, developed specifically for the Spanish market, are widely recognised in this field. Both catalogues of good practices, while sharing similar objectives, differ in their geographical scope and development approach, allowing organisations to select the most appropriate standards for their specific needs and context.

With the publication, a few months ago, of the UNE 0077, 0078, 0079, 0080, and 0081 specifications on data governance, management, quality, maturity, and quality assessment, users may have questions about how these relate to the ISO standards they already have in place in their organisation. This post aims to help alleviate these doubts. To this end, an overview of the main ICT-related standards is presented, with a focus on two of them: ISO 20000 on service management and ISO 27000 on information security and privacy, and the relationship between these and the UNE specifications is established.

Most common ISO standards related to data

ISO standards have the great advantage of being open, dynamic and agnostic to the underlying technologies. They are also responsible for bringing together the best practices agreed and decided upon by different groups of professionals and researchers in each of the fields of action. If we focus on ICT-related standards, there is already a framework of standards on governance, management and quality of information systems where, among others, the following stand out:

At the government level:

ISO 38500 for corporate governance of information technology.

At management level:

ISO 8000 for data management systems and master data.
ISO 20000 for service management.
ISO 25000 for the quality of the generated product (both software and data).
ISO 27000 and ISO 27701 for information security and privacy management.
ISO 33000 for process evaluation.

In addition to these standards, there are others that are also commonly used in companies, such as:

ISO 9000-based quality management system
Environmental management system proposed in ISO 14000

These standards have been used for ICT governance and management for many years and have the great advantage that, as they are based on the same principles, they can be used perfectly well together. For example, it is very useful to mutually reinforce the security of information systems based on the ISO/IEC 27000 family of standards with the management of services based on the ISO/IEC 20000 family of standards.

The relationship between ISO standards and UNE data specifications

The UNE 0077, 0078, 0079, 0080 and 0081 specifications complement the existing ISO standards on data governance, management and quality by providing specific and detailed guidelines that focus on the particular aspects of the Spanish environment and the needs of the national market.

When the UNE 0077, 0078, 0079, 0080, 0080, and 0081 specifications were developed, they were based on the main ISO standards, in order to be easily integrated into the management systems already available in the organisations (mentioned above), as can be seen in the following figure:

Relationship between UNE specifications and the different ISO standards for ICT.

Figure 1. Relation of the UNE specifications with the different ISO standards for ICT.

Example of application of standard UNE 0078

The following is an example of how the UNE and ISO standards that many organisations have already had in place for years can be more clearly integrated, taking UNE 0078 as a reference. Although all UNE data specifications are intertwined with most ISO standards on IT governance, management and quality, the UNE 0078 data management specification is more closely related to information security management systems (ISO 27000) and IT service management (ISO 20000). On Table 1 you can see the relationship for each process with each ISO standard.

Process UNE 0078: Data Management	Related to ISO 20000	Related to ISO 27000
(ProcDat) Data processing
(InfrTec) Technology infrastructure management	X	X
(ReqDat) Data Requirements Management	X	X
(ConfDat) Data Configuration Management
(DatHist) Historical data management	X	‍
(SegDat) Data security management	X	X
‍ (Metdat) Metadata management		X
‍ (ArqDat) Data architecture and design management		‍ X
‍ (CIIDat) Sharing, brokering and integration of data	X	‍
‍ (MDM) Master Data Management		‍ ‍
‍ ‍(HR) Human resources management		‍
‍ (CVidDat) Data lifecycle management	X	‍
‍ (AnaDat) Data analysis		‍

Table 1.Relationship of UNE 0078 processes with ISO 27000 and ISO 20000.

Relationship of the UNE 0078 standard with ISO 20000

Regarding the interrelation between ISO 20000-1 and the UNE 0078 specification, here you can find a use case in which an organisation wants to make relevant data available for consumption throughout the organisation through different services. The integrated implementation of UNE 0078 and ISO 20000-1 enables organisations:

Ensure that business-critical data is properly managed and protected.
Improve the efficiency and effectiveness of IT services, ensuring that the technology infrastructure supports the needs of the business and end users.
Align data management and IT service management with the organisation''s strategic objectives, improving decision making and market competitiveness.

The relationship between the two is manifested in how the technology infrastructure managed according to UNE 0078 supports the delivery and management of IT services according to ISO 20000-1.

This requires at least the following:

Firstly, in the case of making data available as a service, a well-managed and secure IT infrastructureis necessary. This is essential, on the one hand, for the effective implementation of IT service management processes, such as incident and problem management, and on the other hand, to ensure business continuity and availability of IT services.
Secondly, once the infrastructure is in place, and it is known that the data will be made available for consumption at some point in time, the principles of sharing and brokering of that data need to be managed. For this purpose, the UNE 0078 specification includes the process of data sharing, intermediation and integration. Its main objective is to enable its acquisition and/or delivery for consumption or sharing, noting if necessary the deployment of intermediation mechanisms, as well as its integration. This UNE 0078 process would be related to several of the processes in ISO 20000-1, such as the Business Relationship Managementprocess, service level management, demand management and the management of the capacity of the data being made available.

Relationship of the UNE 0078 standard with ISO 27000

Likewise, the technological infrastructure created and managed for a specific objective must ensure minimum data security and privacy standards, therefore, the implementation of good practices included in ISO 27000 and ISO 27701 will be necessary to manage the infrastructure from the perspective of information security and privacy, thus showing a clear example of interrelation between the three management systems: services, information security and privacy, and data.

Not only is it essential that the data is made available to organisations and citizens in an optimal way, but it is also necessary to pay special attention to the security of the data throughout its entire lifecycle during commissioning. This is where the ISO 27000 standard brings its full value. The ISO 27000 standard, and in particular ISO 27001 fulfils the following objectives:

It specifies the requirements for an information security management system (ISMS).
It focuses on the protection of information against unauthorised access, data integrity and confidentiality.
It helps organisations to identify, assess and manage information security risks.

In this line, its interrelation with the UNE 0078 Data Management specification is marked through the Data Security Management process. Through the application of the different security mechanisms, it is verified that the information handled in the systems is not subject to unauthorised access, maintaining its integrity and confidentiality throughout the data''s life cycle. Similarly, a triad can be built in this relationship with the data security management process of the UNE 0078 specification and with the UNE 20000-1 process of SGSTI Operation - Information Security Management.

The following figure presents how the UNE 0078 specification complements the current ISO 20000 and ISO 27000 as applied to the example discussed above.

Figure 2. Visual summarizing the example of data provision explained above

Figure 2. Relation of UNE 0078 processes with ISO 20000 and ISO 27000 applied to the case of data sharing.

Through the above cases, it can be seen that the great advantage of the UNE 0078 specification is that it integrates seamlessly with existing security management and service management systems in organisations. The same applies to the rest of the UNE standards 0077, 0079, 0080, and 0081. Therefore, if an organisation that already has ISO 20000 or ISO 27000 in place wants to implement data governance, management and quality initiatives, alignment between the different management systems with the UNE specifications is recommended, as they are mutually reinforcing from a security, service and data point of view.

Content prepared by Dr. Fernando Gualo, Professor at UCLM and Data Governance and Quality Consultant. The contents and points of view reflected in this publication are the sole responsibility of its author.

04/04/2024

Generative AI for clearer communication: how it helps evaluate, adjust and clarify language

Blog

The alignment of artificial intelligence is a term established since the 1960s, according to which we orient the goals of intelligent systems in the exact direction of human values. The advent of generative models has brought this concept of alignment back into fashion, which becomes more urgent the more intelligence and autonomy systems show. However, no alignment is possible without a prior, consensual and precise definition of these values. The challenge today is to find enriching objectives where the application of AI has a positive and transformative effect on knowledge, social organisation and coexistence.

The right to understand

In this context, one of the main pillars of today's AI, language processing has been making valuable contributions to clear communication and, in particular, to clear language for years. Let us look at what these concepts are:

Clear communication, as a discipline, aims to make information accessible and understandable for all people, using writing resources, but also visual, design, infographics, user experience and accessibility.
Clear language focuses on the composition of texts, with techniques for presenting ideas directly and concisely, without stylistic excesses or omissions of key information.

Both concepts are closely linked to people's right to understand.

Before chatGPT: analytical approaches

Before the advent of generative AI and the popularisation of GPT capabilities, artificial intelligence was applied to plain language from an analytical point of view, with different classification and pattern search techniques. The main need then was for a system that could assess whether or not a text was understandable, but there was not yet the expectation that the same system could rewrite our text in a clearer way. Let's look at a couple of examples:

This is the case of Clara, an analytical AI system that is openly available in beta. Clara is a mixed system: on the one hand, it has learned which patterns characterise clear and unclear texts from the observation of a corpus of peers prepared by specialists in Clear Communication. On the other hand, it handles nine metrics designed by computational linguists to decide whether or not a text meets the minimum requirements for clarity, for example, the average number of words per sentence, the technicalities used or the frequency of connectors. Finally, Clara returns a percentage score to indicate whether the written text is more or less close to being clear text. This allows the user to correct the text according to Clara's indications and submit it for re-evaluation.

However, other analytical systems have established a different approach, such as Artext of course. Artext is more like a traditional text editor, where we can write our text and activate a series of revisions, such as participles, verb nominalisations or the use of negation. Artext highlights in colour words or expressions in our text and advises us in a side menu what we have to take into account when using them. The user can rewrite the text until, in the different revisions, the words and expressions marked in colour disappear.

Both Clara and Artext specialise in administrative and financial texts, with the aim of being of use mainly to public administration, financial institutions and other sources of difficult-to-understand texts that have an impact on citizens.

The generative revolution

Analytical AI tools are useful and very valuable if we want to evaluate a text over which we need to have more control. However, following the arrival of chatGPT in November 2022, users' expectations are set to rise even further. Not only do we need an evaluator, but we expect a translator, an automatic transformer of our text into a clearer version. We insert the original version of the text in the chat and, through a direct instruction called prompt, we ask it to transform it into a clearer and simpler text, understandable by anyone.

If we need more clarity, we only have to repeat the instruction and the text becomes simpler again before our eyes.

By using generative AI we are reducing cognitive effort, but we are also losing much of the control over the text. Most importantly, we will not know what modifications are being made and why, and we may incur the loss or alteration of information. If we want to increase control and keep track of the changes, deletions and additions that chatGPT makes to the text, we can use a plug-in such as EditGPT, available as an extension for Google Chrome, which allows us to keep track of changes similar to Word in our interactions with the chat. However, we would not be able to understand the rationale for the changes made, as we would with tools such as Clara or Artext designed by language professionals. One limiting option is to ask the chat to justify each of these changes, but the interaction would become cumbersome, complex and inefficient, not to mention the excessive enthusiasm with which the model would try to justify its corrections.

Examples of generative clarification

Beyond the speed of transformation, generative AI has other advantages over analytics, such as certain elements that can only be identified with GPT capabilities. For example, detecting in a text whether an acronym or acronym has been developed previously, or whether a technicality is explained immediately after its appearance. This requires very complex semantic analysis for analytical AI or rule-based models. In contrast, a great language model is able to establish an intelligent relationship between the acronym and its development, or between the technicality and its meaning, to recognise if this explanation exists somewhere in the text, and to add it where relevant.

Open data to inform clarification

Universal access to open data, especially when it is ready for computational processing, makes it indispensable for training large linguistic models. Huge sources of unstructured information such as Wikipedia, the Common Crawl project or Gutenberg allow systems to learn how the language works. And, on this generalist basis, it is possible to fit models with specialised datasets to make them more accurate in the task of clarifying text.

In the application of generative artificial intelligence to plain language we have the perfect example of a valuable purpose, useful to citizens and positive for social development. Beyond the fascination it has aroused, we have the opportunity to use its potential in a use case that favours equality and inclusiveness. The technology exists, we just need to go down the difficult road of integration.

Content prepared by Carmen Torrijos, expert in AI applied to language and communication.

The contents and points of view reflected in this publication are the sole responsibility of the author.

05/02/2024