Play to be the best with data

Share

Fecha de la noticia: 31-01-2019

Publicly competing with your colleagues to solve a complex problem based on data is an irresistible motivation for some people. Almost as tempting as gaining relevance in a field of expertise as exciting and lucrative as data science.

Public competitions to solve complex problems, whose raw material of work are public data, are a consolidated trend in the world of data science. The core is data-based problems, from predicting earthquakes to anticipating a stock break in a big distribution centre. The new methods of machine learning and deep learning, as well as the ease of access to powerful calculation technology, have made possible for companies and organizations around the world to open their business problems to communities of data scientists who compete with each other to solve the problem in the best possible way, in order to win an economic reward.

Ten years ago, the now powerful and well-known Netflix streaming video platform published its Netflix Prize. It was an open competition that sought to improve the collaborative filtering algorithm to predict the ratings users gave to movies. The algorithm is based on the previous ratings without any additional information about users or movies. The contest was open to anyone without close links with the company. On September 21, 2009, the grand prize of $ 1,000,000 was awarded to BellKor's Pragmatic Chaos which outperformed the current Netflix algorithm (at that time) to predict ratings passing the 10% mark.

The Netflix Award opened a new option in the fertile field of data science by financially rewarding (with the appreciable amount of $ 1M) to those external teams that were able to improve the key to their business (their recommendation system). Netflix, like many other companies, was and is aware that the talent needed to improve its sophisticated algorithm was not within the organization.

From that moment, many other similar competitions have been opened to solve all kinds of problems based on data. People have even create platforms to manage this type of competition and create a community of talented data scientists around the world's biggest data science challenges. Perhaps Kaggle website is one of the most popular platforms for this type of competitions. At the time of writing this article, there are 9 (active) competitions worth $ 370,000 in Kaggle. There are also competitions that are not economically remunerated (another 9) that give knowledge and points (kudos) within the platform itself to encourage their continued use. For each competition, the platform manages the available data sets as well as the kernels - work environments in the cloud that allow all participants to execute algorithms in an unattended and reproducible way. In addition, the platform establishes the ways to assess the competition as well as ethical codes and licenses for the use of hosted data.

In addition to its function as a platform for public data competitions, Kaggle and other similar platforms such as ImageNet or KDD perform high value function as open data repositories. Currently Kaggle records more than 14,000 data sets in different formats, ready to be exploited and analysed by the most daring data scientists on the planet. Kaggle documents data sets available on the platform extensively. The data formats commonly accepted are CSVs, JSON, SQLite, compressed files in ZIP format and BigQuery (the SQL format for BigData designed by Google). The most common licenses for data use and redistribution are Creative Commons, GPL and Open Database.

Platforms like Kaggle are fantastic. In my opinion, the greatest benefit of Kaggle is the learning capacity, especially to younger data scientists. In Kaggle you can learn a lot about data modelling, perhaps even much more than normally needed in the 90% of the jobs related to Machine Learning. Although we must not forget that, in real life, a data scientist needs much more than modelling knowledge. A good data scientist dedicates 10% of his time to modelling. The 90% remaining is divided among other technical skills in data management and “soft skills”, such as communication, synthesis, relationship with employees and leadership skills.

Remember, if you want to learn a lot about machine learning in real problems, play in Kaggle, but do not forget to train and learn soft skills.

Content prepared by Alejandro Alija, expert in Digital Transformation and innovation.

Contents and points of view expressed in this publication are the exclusive responsibility of its author.