Open source auto machine learning tools

Fecha de la noticia: 02-01-2025

Photo of stock

The increasing complexity of machine learning models and the need to optimise their performance has been driving the development of AutoML (Automated Machine Learning) for years. This discipline seeks to automate key tasks in the model development lifecycle, such as algorithm selection, data processing and hyperparameter optimisation.

AutoML allows users to develop models more easily and quickly. It is an approach that facilitates access to the discipline, making it accessible to professionals with less programming experience and speeding up processes for those with more experience. Thus, for a user with in-depth programming knowledge, AutoML can also be of interest. Thanks to auto machine learning, this user could automatically apply the necessary technical settings, such as defining variables or interpreting the results in a more agile way.

In this post, we will discuss the keys to these automation processes and compile a series of free and/or freemium open source tools that can help you to deepen your knowledge of AutoML.

Learn how to create your own machine learning modeling

As indicated above, thanks to automation, the training and evaluation process of models based on AutoML tools is faster than in a usual machine learning (ML) process, although the stages for model creation are similar.

In general, the key components of AutoML are:

  1. Data processing: automates tasks such as data cleaning, transformation and feature selection.
  2. Model selection: examines a variety of machine learning algorithms and chooses the most appropriate one for the specific task.
  3. Hyperparameter optimisation: automatically adjusts model parameters to improve model performance.
  4. Model evaluation: provides performance metrics and validates models using techniques such as cross-validation.
  5. Implementation and maintenance: facilitates the implementation of models in production and, in some cases, their upgrade.

All these elements together offer a number of advantages as shown in the picture below

 PRINCIPALES VENTAJAS DEL AUTOML  ACCESIBILIDAD  Permite a personas sin experiencia en machine learning crear modelos útiles. EFICIENCIA  Ahorra tiempo al automatizar tareas que de otro modo serían manuales y tediosas. MEJORA EN LA CALIDAD  Puede encontrar soluciones óptimas que un humano podría pasar por alto. Fuente: elaboración propia.

Figure 1. Source: Own elaboration

Examples of AutoML tools

Although AutoML can be very useful, it is important to highlight some of its limitations such as the risk of overfitting (when the model fits too closely to the training data and does not generalise knowledge well), the loss of control over the modelling process or the interpretability of certain results.

However, as AutoML continues to gain ground in the field of machine learning, a number of tools have emerged to facilitate its implementation and use. In the following, we will explore some of the most prominent open source AutoML tools:

H2O.ai, versatile and scalable, ideal for enterprises

H2O.ai is an AutoML platform that includes deep learning and machine learning models such as XGBoost (machine learning library designed to improve model efficiency) and a graphical user interface. This tool is used in large-scale projects and allows a high level of customisation. H2O.ai includes options for classification, regression and time series models, and stands out for its ability to handle large volumes of data.

Although H2O makes machine learning accessible to non-experts, some knowledge and experience in data science is necessary to get the most out of the tool. In addition, it enables a large number of modelling tasks that would normally require many lines of code, making it easier for the data analyst. H2O offers a freemium model and also has an open source community version.

TPOT, based on genetic algorithms, good option for experimentation

TPOT (Tree-based Pipeline Optimization Tool) is a free and open source Python machine learning tool that optimises processes through genetic programming.

This solution looks for the best combination of data pre-processing and machine learning models for a specific dataset. To do so, it uses genetic algorithms that allow it to explore and optimise different pipelines, data transformation and models. This is a more experimental option that may be less intuitive, but offers innovative solutions.

In addition, TPOT is built on top of the popular scikit-learn library, so models generated by TPOT can be used and adjusted with the same techniques that would be used in scikit-learn..

Auto-sklearn, accessible to scikit-learn users and efficient on structured problems

Like TPOT, Auto-sklearn is based on scikit-learn and serves to automate algorithm selection and hyperparameter optimisation in machine learning models in Python.

In addition to being a free and open source option, it includes techniques for handling missing data, a very useful feature when working with real-world datasets. On the other hand, Auto-sklearn offers a simple and easy-to-use API, allowing users to start the modelling process with few lines of code..

BigML, integration via REST APIs and flexible pricing models

BigML is a consumable, programmable and scalable machine learning platform that, like the other tools mentioned above, facilitates the resolution and automation of classification, regression, time series forecasting, cluster analysis, anomaly detection, association discovery and topic modelling tasks. It features an intuitive interface and a focus on visualisation that makes it easy to create and manage ML models, even for users with little programming knowledge.

In addition, BigML has a REST API that enables integration with various applications and languages, and is scalable to handle large volumes of data. On the other hand, it offers a flexible pricing model based on usage, and has an active community that regularly updates the learning resources available.

The following table shows a comparison between these tools:

 

H2O.ai

TPOT

Auto-sklearn

BigML

Use

For large-scale projects.

To experiment with genetic algorithms and optimise pipelines.

For users of scikit-learn who want to automate the model selection process and for structured tasks.

To create and deploy ML models in an accessible and simple way.

Difficult to configure

Simple, with advanced options.

Medium difficulty. A more technical option by genetic algorithms.

Medium difficulty. It requires technical configuration, but is easy for scikit-learn users.

Simple Intuitive interface with customisation options.

Ease of use 

Easy to use with the most common programming languages. It has a graphical interface and APIs for R and Python.

Easy to use, but requires knowledge of Python.

Easy to use, but requires prior knowledge. Easy option for scikit-learn users.

Easy to use, focused on visualisation, no programming skills required.

Scalability

Scalable to large volumes of data.

Focus on small and medium-sized datasets. Less efficient on large datasets

Effective on small and medium sized datasets.

Scalable for different sizes of datasets.

Interoperability

Compatible with several libraries and languages, such as Java, Scala, Python and R.

Based on Python.

Based on Python integrating scikit-learn.

Compatible with REST APIs and various languages.

Community

Extensive and active sharing of reference documentation.

Less extensive, but growing.

It is supported by the scikit-learn community.

Active community and support available.

Disadvantages

Although versatile, its advanced customisation could be challenging for beginners without technical experience.

May be less efficient on large data sets due to the intensive nature of genetic algorithms.

Its performance is optimised for structured tasks (structured data), which may limit its use for other types of problems.

Its advanced customisation could be challenging for beginners without technical experience

Figure 2. Comparative table of autoML tools. Source: Own elaboration

Each tool has its own value proposition, and the choice will depend on the specific needs and environment in which the user works.

Here are some examples of free and open source tools that you can explore to get into AutoML. We invite you to share your experience with these or other tools in the comments section below.

If you are looking for tools to help you in data processing, from datos.gob.es we offer you the report "Tools for data processing and visualisation", as well as the following monographic articles:.