Publication date 11/03/2026
Persona utilizando un ordenador portátil con una interfaz basada en inteligencia artificial en la pantalla.
Description

"I'm going to upload a CSV file for you. I want you to analyze it and summarize the most relevant conclusions you can draw from the data". A few years ago, data analysis was the territory of those who knew how to write code and use complex technical environments, and such a request would have required programming or advanced Excel skills. Today, being able to analyse data files in a short time with AI tools gives us great professional autonomy. Asking questions, contrasting preliminary ideas and exploring information first-hand changes our relationship with knowledge, especially because we stop depending on intermediaries to obtain answers. Gaining the ability to analyze data with AI independently speeds up processes, but it can also cause us to become overconfident in conclusions.
Based on the example of a raw data file, we are going to review possibilities, precautions and basic guidelines to explore the information without assuming conclusions too quickly.

The file:

To show an example of data analysis with AI we will use a file from the National Institute of Statistics (INE) that collects information on tourist flows in Europe, specifically on occupancy in rural tourism accommodation. The data file contains information from January 2001 to December 2025. It contains disaggregations by sex, age and autonomous community or city, which allows comparative analyses to be carried out over time. At the time of writing, the last update to this dataset was on January 28, 2026.

Screenshot of the National Statistics Institute (INE) page corresponding to a dataset. The title of the dataset appears at the top: “Survey of occupancy in rural tourism accommodation.” Below this is a line with an icon and the option to access via API next to the name of the dataset: “Travelers and overnight stays by autonomous communities and cities.” Above this line, there is a pop-up box (tooltip) with the following text: "Last modified: 01/28/2026 Dimensions: Autonomous Communities and Cities, Travelers and overnight stays, Residence Period: Jan. 2001 - Dec. 2025." To the left of the rows are several circular icons and “API” labels indicating options for accessing the data. In the background, part of the text on the page related to the survey and its dimensions can be seen, although it is partially covered by the pop-up box.

Figure 1. Dataset information. Source: National  Institute of Statistics (INE).

1. Initial exploration

For this first exploration we are going to use a free version of Claude, the AI-based multitasking chat developed by Anthropic. It is one of the most advanced language models in reasoning and analysis benchmarks, which makes it especially suitable for this exercise, and it is the most widely used option currently by the community to perform tasks that require code.

Let's think that we are facing the data file for the first time. We know in broad strokes what it contains, but we do not know the structure of the information. Our first prompt, therefore, should focus on describing it:

PROMPT: I want to work with a data file on occupancy in rural tourism accommodation. Explain to me what structure the file has: what variables it contains, what each one measures and what possible relationships exist between them. It also points out possible missing values or elements that require clarification.

Screenshot of Claude's initial exploration of the data file. It includes the section “General description,” which indicates that the file contains 36,000 rows (plus the header) with monthly data on travelers and overnight stays in rural tourism accommodations in Spain, presumably from the INE (National Institute of Statistics). It covers the period from January 2001 to December 2025, with a breakdown by territory and by origin of visitors. Below is the “Variables (columns)” section, which explains several columns in the dataset: Territorial totals, Autonomous Communities and Cities, Travelers and overnight stays, Residence: Level 1 and Residence: Level 2, with descriptions of what each variable measures and the possible values (e.g., travelers, overnight stays, residents in Spain, or residents abroad). Details of each section are provided below. Territorial Totals — This column has only one value in the entire file: “National Total.” This column identifies the maximum level of aggregation, but in practice it does not provide any variation. It probably exists because the original source allows for other levels (provinces, municipalities) that are not included here. Autonomous Communities and Cities — Identifies the specific territory: the 17 autonomous communities plus Ceuta and Melilla (19 territories in total). When this column is empty (NaN), the row corresponds to the aggregate for Spain as a whole, i.e., the “National Total” without a territorial breakdown. There are 1,800 rows in this situation.  Travelers and overnight stays — Indicates what magnitude is being measured. It only has two values: Travelers (number of people staying) and Overnight stays (number of nights those travelers spend). Each row contains one of the two metrics, so both coexist in separate rows for the same territories and periods.  Residence: Level 1 — Only contains the value “Total” in all rows. This is a higher level of aggregation, indicating that the data is not broken down by type of residence at this level. It does not provide useful variation as it appears in the file. Residence: Level 2 — Breaks down the origin of the traveler into three possibilities: the aggregate (NaN, equivalent to “all”), “Residents in Spain,” and “Residents Abroad.” There are 12,000 rows with NaN, which correspond to the total without distinction of origin.

Figure 2. Initial exploration of the data file with Claude. Source: Claude.

Once Claude has given us the general idea and explanation of the variables, it is good practice to open the file and do a quick check. The objective is to assess that, at a minimum, the number of rows, the number of columns, the names of the variables, the time period and the type of data coincide with what the model has told us.

If we detect any errors at this point, the LLM may not be reading the data correctly. If after trying in another conversation the error persists, it is a sign that there is something in the file that makes it difficult to read automatically. In this case, it is best not to continue with the analysis, as the conclusions will be very apparent, but will be based on misinterpreted data.

2. Anomaly management

Second, if we have discovered anomalies, it is common to document them and decide how to handle them before proceeding with the analysis. We can ask the model to suggest what to do, but the final decisions will be ours. For example:

  • Missing values: if there are empty cells, we need to decide whether to fill them with an "average" value from the column or simply delete those rows.
  • Duplicates: we have to eliminate repeated rows or rows that do not provide new information.
  • Formatting errors or inconsistencies: we must correct these so that the variables are coherent and comparable. For example, dates represented in different formats.
  • Outliers: if a number appears that does not make sense or is exaggeratedly different from the rest, we have to decide whether to correct it, ignore it or treat it as it is.
     

Ejemplo de análisis de valores faltantes con Claude. Se le pregunta a Claude “Enséñame un ejemplo de los valores ausentes en Total para Ceuta y Melilla” y su respuesta es: Aquí lo tienes. Las filas nulas de Ceuta y Melilla tienen exactamente la misma estructura: la columna Total está vacía en todos los periodos y para todas las combinaciones de métrica y origen, sin excepción. Son 1.800 filas nulas por cada ciudad (300 meses × 2 métricas × 3 niveles de residencia = 1.800), y ese patrón es perfectamente uniforme: no hay ni un solo mes con dato registrado para ninguna de las dos ciudades. Esto confirma que no se trata de datos perdidos puntualmente, sino de una ausencia total y sistemática. El INE simplemente no recoge turismo rural en Ceuta y Melilla, y las filas existen en el fichero porque la estructura de la tabla las genera por construcción al cruzar todos los territorios con todos los periodos.

Figure 3. Example of missing values analysis with Claude. Source: Claude.

In the case of our file, for example, we have detected that in Ceuta and Melilla the missing values in the Total variable are structural, there is no rural tourism registered in these cities, so we could exclude them from the analysis.

Before making the decision, a good practice at this point is to ask the LLM for the pros and cons of modifying the data. The answer can give us some clue as to which is the best option, or indicate some inconvenience that we had not taken into account.
 

Example of Claude's analysis of whether or not to remove values. Claude is asked, “Does it make sense to remove Ceuta and Melilla from the analysis?” and responds: It depends on the purpose of the analysis, but in most cases it does make sense to remove them, for several reasons. The most obvious is the one we already saw: they do not contribute any useful data. With 3,600 completely empty rows, their presence only adds noise and can complicate aggregate calculations or visualizations without providing any informational benefit. Furthermore, their exclusion is conceptually justified. Ceuta and Melilla are autonomous cities of an urban and port nature, without the type of rural environment that characterizes this type of accommodation. It is not that they lack data: it is that the phenomenon measured by the statistics simply does not exist there in any appreciable form. Including them would distort any comparison between territories, especially if you calculate averages, rankings, or map representations. That said, there are situations in which it might make sense to keep them. If the analysis is purely descriptive of the original file (e.g., a data quality report), removing them would hide relevant information about territorial coverage. And if at some point the INE began to record data for those cities, having them in the structure would facilitate future incorporation.

Figure 4. Claude's analysis on the possibility of eliminating or not securities. Source: Claude.

If we decide to go ahead and exclude the cities of Ceuta and Melilla from the analysis, Claude can help us make this modification directly on the file. The prompt would be as follows:

PROMPT: Removes all rows corresponding to Ceuta and Melilla from the file, so that the rest of the data remains intact. Also explain the steps you're following so they can review them.
 

Capture of Cloud's response to the previous prompt: "It executed two commands and read a file > These are the steps I followed: 1. Loading the original file without modifying it, maintaining the tab separator and UTF-8 encoding with BOM used by the INE file. 2. Filtered using the inverse condition: all rows whose value in Autonomous Communities and Cities is not ‘18 Ceuta’ or ‘19 Melilla’ are retained. The rest of the rows, including those in the National Total (where that column is empty), remain intact. 3. Verification before saving, checking three things: that the difference in rows is exactly 3,600 (as expected: 1,800 from Ceuta + 1,800 from Melilla), that the remaining 17 territories are correct, and that the null values in Total have disappeared completely, confirming that all the null values in the file came exclusively from those two cities. 4. Saved in a new file, maintaining the same format (TSV, UTF-8 with BOM) so that it is compatible with the same tools as the original. The result is a file with 32,400 rows, with no missing values in Total. Travelers and overnight stays without Ceuta Melilla Table · CSV Download

Figura 5. Step by step in the modification of data in Claude. Source: Claude.

At this point, Claude offers to download the modified file again, so a good checking practice would be to manually validate that the operation was done correctly. For example, check the number of rows in one file and another or check some rows at random with the first file to make sure that the data has not been corrupted.

3. First questions and visualizations

If the result so far is satisfactory, we can already start exploring the data to ask ourselves initial questions and look for interesting patterns. The ideal when starting the exploration is to ask big, clear and easy to answer questions with the data, because they give us a first vision.

PROMPT: It works with the file without Ceuta and Melilla from now on. Which have been the five communities with the most rural tourism in the total period?
 

Captura de pantalla de una respuesta generada por Cloud. En la parte superior aparece la indicación:  “Ejecutó 2 comandos >”  Debajo se lee el siguiente texto:  “Tomando como referencia el total acumulado de viajeros a lo largo de todo el período (enero 2001 – diciembre 2025) y usando el agregado sin distinción de origen (residentes en España + extranjeros), el ranking es el siguiente:”  A continuación aparece una tabla con tres columnas: #, Comunidad, Viajeros totales.  Filas de la tabla:  1 — Castilla y León — 14.745.928 2 — Cataluña — 8.568.320 3 — Andalucía — 5.682.985 4 — Castilla-La Mancha — 5.256.132 5 — Illes Balears — 4.953.850  Debajo de la tabla aparece el siguiente párrafo explicativo:  “Castilla y León destaca de forma muy clara, casi duplicando a Cataluña, la segunda. Su posición de liderazgo tiene sentido dado que es la comunidad con mayor superficie rural de España y cuenta con una oferta consolidada de casas rurales desde hace décadas. Llama la atención la presencia de Illes Balears en el top 5, ya que habitualmente se asocia más al turismo de sol y playa, lo que sugiere que también tiene un segmento rural relevante.”

Figure 6. Claude's response to the five communities with the most rural tourism in the period. Source: Claude.

Finally, we can ask Claude to help us visualize the data. Instead of making the effort to point you to a particular chart type, we give you the freedom to choose the format that best displays the information.

PROMPT: Can you visualize this information on a graph? Choose the most appropriate format to represent the data.

Captura de pantalla de una interfaz dividida en dos paneles que muestra la respuesta de Cloud a la petición de visualizar datos en un gráfico.  Panel izquierdo (conversación / explicación)  En la parte superior aparece una etiqueta:  “Estructura de datos de ocupación en turismo rural”  Debajo aparece el mensaje del usuario:  “¿Puedes visualizar esta información en un gráfico? Elige el formato más adecuado para representar los datos.”  A continuación, el asistente responde:  “Visualizo un gráfico de barras horizontales, que es el formato más adecuado para comparar magnitudes entre categorías nominales como las comunidades autónomas. Las barras horizontales permiten leer los nombres completos y facilitan comparar visualmente las distancias entre valores, algo que un gráfico de tarta o líneas haría tan bien aquí.  Cada barra se escala proporcionalmente respecto al máximo (Castilla y León) y se animan al cargar para que la comparación resulte más intuitiva.”  Debajo aparece un bloque descargable con un icono de archivo y el texto:  “Top5 turismo rural” Código · HTML  A la derecha del bloque hay un botón:  “Descargar”  En la parte inferior izquierda del panel aparece un pequeño icono de carga (spinner).  Panel derecho (vista previa del gráfico)  En la parte superior derecha hay una barra con el nombre del archivo:  “Top5 turismo rural · HTML”  A la derecha aparece un botón:  “Copiar”  Debajo se muestra la visualización:  Texto pequeño en la parte superior:  “ENCUESTA DE OCUPACIÓN EN ALOJAMIENTOS DE TURISMO RURAL · INE”  Título grande:  “Top 5 comunidades en turismo rural”  Subtítulo:  “Total acumulado de viajeros · 2001–2025”  Debajo aparece un gráfico de barras horizontales con cinco comunidades y sus valores aproximados:  Castilla y León — barra larga de color rojo oscuro — 14,7M  Cataluña — barra verde — 8,6M  Andalucía — barra marrón — 5,7M  Castilla-La Mancha — barra verde claro — 5,3M  Illes Balears — barra naranja claro — 4,9M  En la parte inferior aparece un pequeño texto de nota:  “Datos agregados de viajeros en alojamientos de turismo rural (residentes en España + extranjeros). Fuente: Encuesta de Ocupación en Alojamientos de Turismo Rural, INE.”  En la esquina inferior derecha se lee:  “España · 2001–2025”.

Figure 7. Graph prepared by Cloude to represent the information. Source: Claude.

Here, the screen unfolds: on the left, we can continue with the conversation or download the file, while on the right we can view the graph directly. Claude has generated a very visual and ready-to-use horizontal bar chart. The colors differentiate the communities and the date range and type of data are correctly indicated.

What happens if we ask you to change the color palette of the chart to an inappropriate one? In this case, for example, we are going to ask you for a series of pastel shades that are hardly different.

PROMPT: Can you change the color palette of the chart to this? #E8D1C5, #EDDCD2, #FFF1E6, #F0EFEB, #EEDDD3
 

Screenshot of a split-screen interface showing Cloud's response to the request to change the color palette of a graph.  Left panel (conversation)  At the top is a label:  “Rural tourism occupancy data structure.”  Below is the user's message in a speech bubble:  "Can you change the color palette of the graph to this one? #E8D1C5, #EDDCD2, #FFF1E6, #F0EFEB, #EEDDD3“  Below is the assistant's response:  ”I will use the colors from the new palette in soft and light tones, which may make the bars difficult to distinguish from the background. To maintain readability, I will slightly darken the background and adjust the text color of the value labels so that they contrast well.  Done. I have applied the five colors you specified to the bars. Since the palette is very light and pale, I have also changed the background to a dark shade so that the bars stand out and the chart is readable, keeping the rest of the design intact."  Below is a downloadable block with a file icon and the text:  “Top 5 rural tourism” Code · HTML.  To the right of the block there is a button:  “Download”  At the bottom left of the panel there are several status indicators and a small loading icon (spinner).  Right panel (graph preview) At the top there is a bar with the file name: “Top 5 rural tourism · HTML.” On the right are action buttons, including: “Copy.” Below is the visualization with a dark background.  Small text at the top of the graph:  “SURVEY OF OCCUPANCY IN RURAL TOURISM ACCOMMODATION · INE”  Main title:  “Top 5 communities in rural tourism”  Subtitle:  “Cumulative total of travelers · 2001–2025”  Below is a horizontal bar chart with five communities and values to the right of each bar. The bars use light shades from the requested palette.  Graph elements:  Castile and León — long light-colored bar — 14.7 million  Catalonia — light bar — 8.6 million  Andalusia — light bar — 5.7 million  Castile-La Mancha — light bar — 5.3 million  Balearic Islands — light bar — 5.0M  At the bottom, there is a small explanatory text:  “Aggregate data on travelers in rural tourism accommodations (residents in Spain + foreigners). Source: Survey of Occupancy in Rural Tourism Accommodations, INE.”  In the lower right corner of the graph, it reads:  “Spain · 2001–2025.”

.Figure 8. Adjustments made to the graph by Claude to represent the information. Source: Claude.

Faced with the challenge, Claude intelligently adjusts the graphic himself, darkens the background and changes the text on the labels to maintain readability and contrast

 

All of the above exercise has been done with Claude Sonnet 4.6, which is not Anthropic's highest quality model. Its higher versions, such as Claude Opus 4.6, have greater reasoning capacity, deep understanding and finer results. In addition, there are many other tools for working with AI-based data and visualizations, such as Julius or Quadratic. Although the possibilities are almost endless in them, when we work with data it is still essential to maintain our own methodology and criteria. 

Contextualizing the data we are analyzing in real life and connecting it with other knowledge is not a task that can be delegated; We need to have a minimum prior idea of what we want to achieve with the analysis in order to transmit it to the system. This will allow us to ask better questions, properly interpret the results and therefore make a  more effective prompting.

Content created by Carmen Torrijos, expert in AI applied to language and communication. The content and views expressed in this publication are the sole responsibility of the author.