Kaggle | datos.gob.es

Kaggle and other alternative platforms for learning data science

Blog

The profession of the data scientist is booming. According to him 2020 LinkedIn Emerging Jobs Report, the demand for data science specialists grew 46.8% compared to the previous year, being especially demanded in sectors such as banking, telecommunications or research. The report also indicates that among the capabilities that companies demand are "Machine Learning, R, Apache Spark, Python, Data Science, Big Data, SQL, Data Mining, Statistics and Hadoop." Training ourselves in this type of tools and capabilities is therefore a notable competitive advantage in the workplace.

In this context, it is not surprising that the university offer in these subjects does not stop growing. But at the same time there are also alternatives that allow us to expand our knowledge in a playful way.

Gamification to learn data science

One of the best ways to learn new skills is through play. The resolution of challenges and real cases allows us to test our knowledge and exercise new skills in an entertaining and motivating way. It is what is known as gamification, a learning technique that applies elements of game design to non-playful contexts. In this case we are talking about learning, but it can also be applied to marketing or even sectors like health and welfare, among others.

Gamification is a perfect technique for acquiring data-related capabilities, revealed through competitions such as hackathons or application and idea contests - like our Challenge Contribute -. But in recent years, online platforms that propose open competitions in the form of challenges to users have also grown.

Kaggle, a space for open competitions

Of all these platforms, the best known is Kaggle, which brings together more than 7 million registered users from around the world. It is a free platform that provides users with problems to solve using data science, predictive analytics or machine learning techniques, among others.

There are problems for beginners, like predict survival on the Titanic -a binary classification problem- or house prices, for which it is necessary to use advanced regression techniques. Some competitions start directly from companies that seek to solve a challenge that resists them and choose to open it up to platform users, as did the Santander Bank. Occasionally, there can be large cash prizes for the user who finds the best solution. An example is the american football league, which seeks to predict blows against players' helmets and awards $ 100,000 to whoever succeeds. There are also companies that specifically create contests in which the winners have the opportunity for an interview with their data science team, as did Facebook, a few years ago. Kaggle is therefore a good formula to expand the possibilities of finding a good job. Many recruiters keep their eye on the platform when it comes to locating new talent, paying particular attention to the winners of the competitions.

In addition to competitions, Kaggle offers other functionalities:

A section to share datasets. There are currently more than 50,000 shared public data sets, which can be freely used to practice, solve competitions or train algorithms.
Free courses, which cover topics such as Python, introduction to machine learning, geospatial analysis or natural language processing. They are designed to quickly introduce the user to essential topics and guide them through the Kaggle platform. Once you have the basic knowledge, it is time to participate in competitions.
Notebooks, shared by Kaggle users. This is the code, along with tutorials, that the participants in the competitions have used to solve different problems. There are currently more than 500,000. In order to run and practice them, Kaggle has a computational environment designed to facilitate the reproduction of data science work.
A discussion forum, where to solve doubts and share feedback. By signing up for Kaggle, you not only gain numerous resources, but you also become part of a community of experts. Being present in the forum is key to expanding knowledge and meeting other users, making a team and enriching yourself with the experience of those who master the subject in question.

Kaggle uses a progression system with different types of user, according to their level of performance in each area. On the one hand, there are 5 levels of performance: Novice, Contributor, Expert, Master and Grandmaster. On the other, four categories of experience in data science from Kaggle: Competitions, Notebooks, Datasets and Discussion, which refer to user participation in each area. Progress through the performance levels is done independently within each experience category, so that the same user can be a Master in Competitions, but Novice in Discussion.

The success of Kaggle is so great that in 2017 it was acquired by Google.

If you are thinking of participating in a competition, you have some tips in this post, video and presentation.

Other platforms similar to Kaggle

In addition to Kaggle, we also find other similar platforms on the web that host competitions and challenges related to data.

DrivenData. Organize online challenges, which usually last between 2 and 3 months, some of them with financial prizes. An example of competition is the construction of machine learning algorithms capable of mapping floods using satellite images of Sentinel-1. They also have a datalab where they offer companies their services to build solutions related to data.
Devpost. It offers a repository of hackathons that users can sign up to, most of them online. Includes company competitions such as Amazon or Microsoft. Some competition accumulates up to $ 5 million to distribute in prizes.
Innocentive. Collect challenges from various organizations - some also with large prize figures. Although it has technical competitions, it also includes theoretical or strategic challenges in which only a theoretical proposal is necessary.
CrowdAnalytix. With more than 25,000 users, crowdAnalytix is a community where data experts collaborate and compete to customize and optimize algorithms. An example is this competition, where the evolution of crops had to be predicted using public satellite images.

Platforms for learning data science through gamification: kaggle, datadriven, devpost, innocentive, crowdanalytix

A good profile on Kaggle, or on the rest of the platforms that we have seen, will help you gain more experience and create a good portfolio of work. It will also make you more attractive to recruiters, increasing your chances of landing a good job. A good performance at Kaggle demonstrates problem-solving and teamwork skills, which are some of the characteristics necessary to become a good data scientist.

Content prepared by the datos.gob.es team.

16/09/2021

Play to be the best with data

Blog

Publicly competing with your colleagues to solve a complex problem based on data is an irresistible motivation for some people. Almost as tempting as gaining relevance in a field of expertise as exciting and lucrative as data science.

Public competitions to solve complex problems, whose raw material of work are public data, are a consolidated trend in the world of data science. The core is data-based problems, from predicting earthquakes to anticipating a stock break in a big distribution centre. The new methods of machine learning and deep learning, as well as the ease of access to powerful calculation technology, have made possible for companies and organizations around the world to open their business problems to communities of data scientists who compete with each other to solve the problem in the best possible way, in order to win an economic reward.

Ten years ago, the now powerful and well-known Netflix streaming video platform published its Netflix Prize. It was an open competition that sought to improve the collaborative filtering algorithm to predict the ratings users gave to movies. The algorithm is based on the previous ratings without any additional information about users or movies. The contest was open to anyone without close links with the company. On September 21, 2009, the grand prize of $ 1,000,000 was awarded to BellKor's Pragmatic Chaos which outperformed the current Netflix algorithm (at that time) to predict ratings passing the 10% mark.

The Netflix Award opened a new option in the fertile field of data science by financially rewarding (with the appreciable amount of $ 1M) to those external teams that were able to improve the key to their business (their recommendation system). Netflix, like many other companies, was and is aware that the talent needed to improve its sophisticated algorithm was not within the organization.

From that moment, many other similar competitions have been opened to solve all kinds of problems based on data. People have even create platforms to manage this type of competition and create a community of talented data scientists around the world's biggest data science challenges. Perhaps Kaggle website is one of the most popular platforms for this type of competitions. At the time of writing this article, there are 9 (active) competitions worth $ 370,000 in Kaggle. There are also competitions that are not economically remunerated (another 9) that give knowledge and points (kudos) within the platform itself to encourage their continued use. For each competition, the platform manages the available data sets as well as the kernels - work environments in the cloud that allow all participants to execute algorithms in an unattended and reproducible way. In addition, the platform establishes the ways to assess the competition as well as ethical codes and licenses for the use of hosted data.

In addition to its function as a platform for public data competitions, Kaggle and other similar platforms such as ImageNet or KDD perform high value function as open data repositories. Currently Kaggle records more than 14,000 data sets in different formats, ready to be exploited and analysed by the most daring data scientists on the planet. Kaggle documents data sets available on the platform extensively. The data formats commonly accepted are CSVs, JSON, SQLite, compressed files in ZIP format and BigQuery (the SQL format for BigData designed by Google). The most common licenses for data use and redistribution are Creative Commons, GPL and Open Database.

Platforms like Kaggle are fantastic. In my opinion, the greatest benefit of Kaggle is the learning capacity, especially to younger data scientists. In Kaggle you can learn a lot about data modelling, perhaps even much more than normally needed in the 90% of the jobs related to Machine Learning. Although we must not forget that, in real life, a data scientist needs much more than modelling knowledge. A good data scientist dedicates 10% of his time to modelling. The 90% remaining is divided among other technical skills in data management and “soft skills”, such as communication, synthesis, relationship with employees and leadership skills.

Remember, if you want to learn a lot about machine learning in real problems, play in Kaggle, but do not forget to train and learn soft skills.

Content prepared by Alejandro Alija, expert in Digital Transformation and innovation.

Contents and points of view expressed in this publication are the exclusive responsibility of its author.

26/11/2019

Examples of uncommon open data repositories

Blog

Beyond public administrations, libraries, museums and cultural foundations data, the interest in open data knows no borders. We invite you to discover it in this post.

Normally, the concept of open data is associated with those repositories managed by public administrations, foundations and cultural organizations such as libraries and museums. But open data include much more and, if we use our search thoroughly, we can find real jewels waiting to be explored. Many times they are repositories on specific topics, very useful for professionals who develop their work in that field. Others, they are general repositories with unusual data sets.

Let's see some examples.

Open data and science

To illustrate the specific data repositories, let's focus on two examples of the scientific field:

1) Open data portal of the European Space Agency. On this website we can find a large number of images and data from the different space missions of the European Space Agency (for its acronym in English, ESA). For example, most satellite images of the Copernicus program - the most ambitious Earth observation program to date - provide accurate, timely and easily accessible information to improve environmental management, understand and mitigate the effects of climate change and ensure civil security.

Example of an open image under CC BY-SA 3.0 IGO license from the ESA open data repository, in particular from the Copernicus ground observation program.

ESA not only makes available images and videos from satellites, but also a large amount of observation data that can be processed to generate our own images or analysis. As an example, the data generated by the Gaia mission - the most ambitious mission to draw a three-dimensional map of our Galaxy - is available for direct download on this link. Browsing the links that depend on the main repository, we can access to files in .csv format of several tens of MB ready for analysis.

2) CERN open data portal. CERN is the European laboratory for nuclear research. The place where the Web (World Wide Web) was born, concentrates a good part of the best scientific talent in Europe and generates several dozen petabytes of data per year. In this way, CERN also has its own website dedicated to open data. The CERN open data site is a very user-friendly website for the non-expert user that proposes different ways of approaching the data stored there. There are different paths to explore the site depending on whether we follow the Learn, Visualize or Analyze path. This website is a vergel of data, but it is necessary to have basic (or not so basic) notions of particle physics, to exploit its full potential.

In addition to the core site, CERN makes available to (advanced) users a Github site so that those developers who want to work with open data, have a more suitable environment for the exploitation of data programmatically. Github sites or other open source repositories enhance the development of collaborative users’ communities around open data.

Very, very diverse data

But in addition to these specific repositories, there are also general theme repositories whit unusual data sets. We have already spoken on previous occasions of the Kaggle website. Kaggle is an open web platform aimed at data scientists in which challenges are posed (some of them paid with high cash prizes). This time we approached Kaggle only to explore its extensive catalog of data (mostly published under a Creative Commons license in one of its variants).

To cite some varied examples, looking in the first entries of its catalog we find data sets on the height of the waves on the Australian coast or for example, a data set that includes a list of 10,000 women's shoes with their prices published under license CC BY-NC-SA 4.0. You could not miss in this list one of the most popular and used data sets today. Every quarter, Stackoverflow, the largest online community for programmers, publishes an extraction of its database with the posts, votes, tags and comments that have passed through its platform. The analysis of this data set (published under CC BY-SA 3.0) of more than 100 GB in volume is probably the most accurate way to measure market trends in terms of popularity and use of existing programming languages.

In short, in addition to the existing data sets on mobility, environment, location of basic services in cities or cultural collections, there are open data repositories, much more specific, for those intrepid users who dare to investigate in search of less common data. Of course, the future of open data has no borders.

Content prepared by Alejandro Alija, expert in Digital Transformation and innovation.

Contents and points of view expressed in this publication are the exclusive responsibility of its author.

25/11/2019