Communicating your data analysis

Share

Fecha de la noticia: 18-11-2020

Introduction

In this new post we introduce an important topic in the data analysis sector and that, however, tends to go unnoticed by most of the non-specialist audience. When we talk about advanced data analytics, we tend to think of sophisticated tools and advanced knowledge in machine learning and artificial intelligence. Without detracting from these skills so demanded today, there are much more basic aspects of data analysis that have a much greater impact on the end user or consumer of results. This time we talk about the communication of data analysis. Good communication of the results and process of a data analysis can make the difference between the success or failure of a data analytics project.

Communication. An integrated process

We could think that the communication of data analysis is a subsequent process and decoupled from the technical analysis itself. Ultimately, it is something that is left for last. This is a very common mistake among analysts and data scientists. Communication must be integrated with the analysis process. From the more tactical perspective of the code documentation and the analysis process, to the communication to the final public (in the form of presentations and / or reports). Everything must be an integrated process. In the same way that the DevOps philosophy has prevailed in the world of software development, in the data analysis space the DataOps philosophy must prevail. In both cases, the goal is continuous and agile delivery of value in the form of software and data.

Gartner defines DataOps as "a collaborative data management practice focused on improving communication, integration and automation of data flows between data managers and consumers in an enterprise."

Innovation Insight for DataOps

Benefits of using an integrated data analysis and communication methodology.

- A single controlled and governed process. When we adopt DataOps we can be sure that all stages of data analysis are under control, governed and highly automated. This results in control and security of the data pipeline.

- Reproducible data science. When we communicate results and / or part of the data analysis process, it is common for other collaborators to start from your work to try to improve or modify the results. Sometimes they will just try to reproduce your same results. If the final communication has been part of the process in an integrated and automated way, your collaborators will have no problem reproducing those same results themselves. Otherwise, if the communication was done at the end of the process and decoupled (both in time and in the use of different tools) from the analysis, there is a high probability that the replay attempt will fail. Software development processes, whether they include data or not, are highly iterative. That is, hundreds if not thousands of code changes are made before obtaining the desired results. If these iterative changes, no matter how small, are decoupled from the final communication, surely, the result obtained will have obviated changes that will make their direct reproduction impossible.

- Continuous delivery of value. On many occasions I have experienced situations in which the preparation of results is left as the last phase of a project or data analysis process. Most of the efforts are focused on data analysis development and algorithm development (where applicable). This fact has a clear consequence. The last task is to prepare the communication and therefore it is the one that ends up concentrating less focus. All team efforts have been spent in previous phases. We are exhausted and the documentation and communication is what separates us from the delivery of the project. As a consequence, project documentation will be insufficient and communication poor. However, when we present the results to our clients, we will desperately try to convince that an excellent job of analysis has been done (and it will be) but our weapon is the communication that we have prepared and it is predictably much better.

- Improvement in the quality of communication. When we integrate development and communication, we are monitoring at all times what our clients are going to consume. In this way, during the analysis process, we have the agility to modify the results that we are producing (in the form of graphs, tables, etc.) in analysis time. On multiple occasions I have seen how after closing the analysis phase and reviewing the results produced, we realize that something is not well understood or can be improved in terms of communication. They can be simple things like the colors in a legend or the decimal digits in a table. However, if the analysis has been carried out with very different tools and decoupled from the production of results (for example, a presentation), the very idea of rerunning the analysis project to modify small details will set you off alarms at any time. analyst or data scientist. With the DataOps methodology and the right tools, we just have to rerun the data pipeline with the corresponding changes and everything will be re-created automatically.

Integrated communication tools

We have talked about the methodology and its benefits, but we must know that the tools to correctly implement the strategy play an important role. Without going any further, this post has been made entirely with a data pipeline that in the same process integrates: (1) the writing of this post, (2) the creation of a website for publication, (3) the versioning of the code and (4) the data analysis, although in this case, it is not relevant, as it only serves to illustrate that it is one more part of the process.

Without going into too many technical details, in the same work environment (and programming and document writing) RStudio and using the Markdown and Blogdown plugins we can create a complete website where we can publish our posts, in this case about analysis of data. The detailed explanation about the creation of the website that will host the following content posts will be left for another time. At this time we are going to focus on the generation of this content post in which we will show an example of data analysis.

To illustrate the process we are going to use this dataset available at datos.gob.es. It is a set of data that collects the uses of the citizen card of the Gijón City Council, during 2019 in the different services available in the city.

As we can see, at this point, we are already integrating the communication of a data analysis with your own analysis. The first thing we are going to do is load the dataset and see a preview of it.

file <- "http://opendata.gijon.es/descargar.php?id=590&tipo=EXCEL"
        Citicard <- read_csv2(file, col_names = TRUE)
        head(Citicard)

Date	Service	Instalation	Uses
2019-01-01	Public toilets	Avda Carlos Marx	642
2019-01-01	Public toilets	Avda del Llano	594
2019-01-01	Public toilets	C/ Puerto el Pontón	139
2019-01-01	Public toilets	Cerro Santa Catalina	146
2019-01-01	Public toilets	Donato Argüelles	1095
2019-01-01	Public toilets	El Rinconín	604

Next we are going to generate an obvious and simple analysis in any exploratory phase (EDA - Exploratory Data Analysis) of a new data set. We are going to add the data set by Date and Services, thus obtaining the sum of card uses by date and type of service.

Citi_agg <- Citicard %>%
        group_by(Fecha, Servicio) %>%
        summarise(Usos = sum(Usos)) 
      
      head(Citi_agg)

Date	Services	Uses
2019-01-01	Public toilets	17251
2019-01-01	Library loans	15369
2019-01-01	Transport	1201471
2019-02-01	Public toilets	18186
2019-02-01	Library loans	14716
2019-02-01	Transport	1158109

We graph the result and observe how the majority use of the citizen card is the payment of public transport. Since we have generated an interactive graph, we can select in the Autoscale controls and click on the legend to remove the transport column and analyze in detail the differences between the use of Public Toilets and Library Loans.

Citi_fig <- ggplot(Citi_agg, aes(x=Fecha, y=Usos/1000, fill=Servicio)) +
        geom_bar(stat="identity", colour="white") + labs(x = "Servicio", y = "Uso Tarjeta (en miles)") + 
        theme(
        axis.title.x = element_text(color = "black", size = 14, face = "bold"),
        axis.title.y = element_text(color = "black", size = 10, face = "bold")
      ) 
      
      ggplotly(Citi_fig)

When we discard for a moment the use of the card as a means of payment in public transport, we observe the greater use of the card for access to public toilets and to a lesser extent for the loan of means in public libraries. In the same way, the use of the zoom allows us to see with greater comfort and detail these differences in specific months.

If we ask ourselves what is the distribution of the total use of the Citizen Card throughout 2019, we can generate the following visualization and verify the evident result that the use in public transport represents 97%.

Tot_2019_uses <- sum(Citi_agg$Usos)
        Citi_agg_tot <- Citicard %>%
          group_by(Servicio) %>%
          summarise(Usos = 100*sum(Usos)/Tot_2019_uses) %>%
          arrange(desc(Usos))
        
        knitr::kable(Citi_agg_tot, "pipe", digits=0, col.names=(c("Servicio Usado", "Uso Tarjeta en %")))

Used Service	Card Use in%
Transport	97
Public toilets	2
Library loans	1

ggplot(Citi_agg_tot,aes(x=Servicio, y=Usos, fill=Servicio)) + 
        geom_bar(stat="identity", colour="white") + labs(x = "Servicio", y = "Uso en %") + 
        theme(
        axis.title.x = element_text(color = "black", size = 14, face = "bold"),
        axis.title.y = element_text(color = "black", size = 14, face = "bold")
      ) -> Citi_fig2
      
      
      ggplotly(Citi_fig2)

Finished. In this post, we have seen how an integrated communication strategy allows us to integrate our technical analysis with the generation of consumable results in the form of tables and graphs ready for the end user. In the same process we integrate the calculations (aggregations, normalizations, etc.) with the production of quality results and with a format adapted to the non-specialist reader. In a non-integrated communication strategy, we would have post-processed the results of the technical analysis at a later point in time, and probably in a different tool. This would have made us less productive while losing track of the steps we have followed to generate the final result.

Conclusions

Communication is a fundamental aspect of data analysis. Poor communication can ruin excellent data analysis work, no matter how sophisticated it is. To carry out good communication it is necessary to implement an integrated communication strategy. This happens by adopting the DataOps philosophy to develop excellent, reproducible and automated work of data flows. We hope you liked the topic of this post and we will return later with content on DataOps and data communication. See you soon!