Free GPU! The spring of machine learning for civilian players is here

2019/09/1904:45:10 technology 1684

In recent years, data science has shown two obvious trends:

1. More and more data analysis and model training are completed through cloud computing

2. Machine learning workflow (English) The name is pipeline) itself is being optimized through algorithms

Cloud computing with Google Colab

Nowadays, almost everyone has their own computer. But laptops and desktop computers are generally only suitable for daily work. Nowadays, the data sets to be processed by machine learning are getting larger and larger, and the requirements for computing power are getting higher and higher. Using cloud computing for machine learning is almost the best choice for ordinary users.

In this article, we will use Jupyter Notebook on the cloud to run a simple data workflow. Using Google's latest black technology Google Colab-a free online Jupyter Notebooks (currently only the Python kernel). The emergence of this product means that no matter where you are, whether you have a personal computer with you, as long as you can connect to the Internet, you can run your own machine learning model. Most of the data science libraries you need are already configured on Google's virtual machine, you can use it directly without configuring the environment, and you can use an NVIDIA Tesla K80 GPU for free! ! !

To use Colab, you only need to be able to connect to the Internet (the domestic students need to go online scientifically) and a Google account. Without further ado, hurry up and follow me to explore Google Colab. The notebook corresponding to this article is here: https://colab.research.google.com/drive/1CIVn-GoOyY3H2_Bv8z09mkNRokQ9jlJ-

Copy it to the Chrome browser and open it directly. You need to log in to your Google account. The interface after opening is as follows, click File>Save a copy in Drive. Then you can open this file on your Drive for editing and running.

Free GPU! The spring of machine learning for civilian players is here - DayDayNews

It can be said that Google Colab has significantly lowered the threshold for using cloud computing. It is not difficult to imagine that similar online resources will become more and more accessible in the future. For students who have used Jupyter Notebooks on their local computers, this is a good opportunity to transition to cloud computing.

Use TPOT for machine learning automation

Next, I will introduce another artifact to the students-machine learning automation (abbreviated as Auto-ml). It can design and optimize machine learning workflows for specific problems through algorithms. In this article, the machine learning pipeline includes the following steps:

1. Feature preprocessing: fill missing values, zoom, build new features

2. Feature selection: dimensionality reduction

3. Model selection: evaluate multiple models

4. Tuning: find the best model hyperparameter settings

Combine the above four steps, you can get almost unlimited variety of pipelines, and each The best solution to the problem varies. Designing a machine learning pipeline is a very time-consuming and easy process, so we generally cannot traverse all the pipelines, which means that you never know whether the pipeline you design is optimal. At this time, machine learning automation emerged, which can help you evaluate the performance of thousands of possible pipelines and automatically find the optimal (or close to optimal) solution.

Machine learning is only part of data science, and machine learning automation does not mean that it can replace data scientists. On the contrary, machine learning automation can free the hands of data scientists, allowing them to focus on more valuable parts, such as data collection, model interpretation, etc.

There are already many tools for machine learning automation-H20, auto-sklearn, Google Cloud AutoML, and TPOT (Tree-based Pipeline Optimization Tool) that I will focus on next. TPOT is mainly based on the principle of genetic algorithm to find the best machine learning pipeline.

Free GPU! The spring of machine learning for civilian players is here - DayDayNews

The main benefit of genetic algorithms for building machine learning models is in-depth exploration. For people, even if there is no time limit, they cannot try all the combinations of preprocessing, models, and hyperparameters. After all, personal knowledge and imagination are limited. The genetic algorithm has no initial bias for any machine learning pipeline (humans may generate some biases based on their own experience), and each pipeline will be objectively evaluated. In addition, the fitness function in the genetic algorithm makes the pipeline exploration space, the most potential combination area explored more thoroughly than the poor performance area, which is also a major advantage of genetic algorithm.

Combine the two: machine learning automation on the cloud

Come on, this implementation is actually very simple! With the aforementioned background, we can happily use TPOT on Google Colab to automate machine learning.

Let's try to solve a problem of supervision regression: through the energy data of New York City, we hope to predict the energy star rating of the building. The author once manually performed feature engineering, dimensionality reduction, model selection, and parameter adjustment, and finally trained a Gradient Boosting regression model. The average absolute error on the test set was 9.06. Let's take a look at the performance of the model obtained after automation? The

data set contains dozens of continuous numerical variables (such as building energy usage and building area) and two one-hot coded categorical variables (region name and building type), with a total of 82 features.

First, we need to determine whether TPOT has been installed in the Google Colab environment. Generally speaking, most of the data science packages are already installed. If you want to add a new package, you can use the following command (remember to add "!" in front):

Free GPU! The spring of machine learning for civilian players is here - DayDayNews

After reading the data, we usually fill in missing values ​​and normalize features. The good news is that in addition to the feature engineering, model selection, and parameter adjustment described above, TPOT will also automatically fill in missing values ​​and perform feature scaling! So, we only need to create the TPOT optimizer in the next step.

Free GPU! The spring of machine learning for civilian players is here - DayDayNews

Using the default parameters, the TPOT optimizer will create 100 pipelines, and each pipeline will evolve 100 generations to get the score of these 10,000 pipelines. Using ten-fold cross-validation, this means that there will be 100,000 training runs! Even if we are using Google’s computing resources, there will still be a time limit. In order to avoid exceeding the time limit of the Colab server (Google only allows 12 hours of continuous running time), we will set the maximum running time of TPOT to 8 hours. Although the general running time of TPOT is several days, it has been optimized for several hours , We can still get a good model.

We will set the following parameters:

scoring = neg_mean_absolute error: regression performance evaluation indicator

max_time_minutes = 480: limit the running time to 8 hours

n_jobs = -1 : Use all available cores on the computer

● verbosity = 2: Display limited information during training

● cv = 5: Use 5-fold cross validation (default value is 10)

Of course, also There are other parameters that can be set, but they are also suitable for most situations by keeping the default values.No additional settings here.

The grammatical design of the TPOT optimizer is the same as the Scikit-Learn model, so we can use the .fit method to train the optimizer.

Free GPU! The spring of machine learning for civilian players is here - DayDayNews

During the training process, we obtained the following information:

Free GPU! The spring of machine learning for civilian players is here - DayDayNews

Due to time constraints, each pipeline can only evolve In the 15th generation, this means that we have evaluated the scores of 1500 different independent pipelines, which is much more than we tried manually!

Once the model is trained, we can check the optimal pipeline through tpot.fitted_pipeline_. We can also save the model to a Python script:

Free GPU! The spring of machine learning for civilian players is here - DayDayNews

Since we are using Google Colab notebook, if we want to download this pipeline from the server to the local, we need Use Google Colab's file management library:

We can open the tpot_exported_pipeline.py file to view the complete pipeline:

Free GPU! The spring of machine learning for civilian players is here - DayDayNews

(The download address of this file is at the end of the article)

We can see that the optimizer has filled in missing values ​​for us and established a complete pipeline! The final prediction model is a fusion model (Stacking model) that combines the two algorithms of LassoLarsCV and GradientBoostingRegressor. To be honest, if I train myself, I may not be able to get such a complicated model.

Now, the exciting time has come, let's take a look at the performance of the model on the test set. We can use .score to get the average absolute error:

Free GPU! The spring of machine learning for civilian players is here - DayDayNews

This project I used to spend several hours manually to complete, and finally got the average absolute error of the Gradient Boosting Regressor model. It is 9.06. Machine learning automation really significantly improves the performance of the final model and also drastically reduces development time.

Summary

In this article, we briefly introduced the use of cloud computing for machine learning and machine learning automation. As long as you have a Google account and can connect to the Internet, you can use Google Colab to develop, run and share machine learning work files. Using TPOT, the optimal machine learning pipeline (including feature preprocessing, model selection, and parameter tuning) can be obtained through automated training and evaluation processes. In addition, we also realize that machine learning automation will not replace data scientists, but it It will allow data scientists to free up more time to spend on more valuable work.

As a newly born thing, TPOT is relatively mature and very easy to use. You guys don’t rush to use this method to try to solve machine learning problems (there are many good projects on Kaggle)! Running an automated machine learning project on Google Colab’s notebook is simply futuristic, and the threshold is actually so. It’s so low, let’s not say, I suddenly wanted to run it on my phone~

Free GPU! The spring of machine learning for civilian players is here - DayDayNews

Perfect operation!

The relevant file download address mentioned in the article:

https://colab.research.google.com/drive/1CIVn-GoOyY3H2_Bv8z09mkNRokQ9jlJ-

https://github.com/WillKoehrsen/machine-learning-project-walkthrough/blob/master/auto_ml/tpot_exported_pipeline.py

Text-William Koehrsen

Translated-Allen

Original-https://towardsdatascience.com/automated-machine -learning-on-the-cloud-in-python-47cf568859f

Predicting housing prices, dog identification, and degrading analysis, Silicon Valley’s superb practical projects are waiting for you to challenge. Click on the card below, Sebastian, the founder of Udacity, will personally teach you the very important modeling and algorithm foundations in artificial intelligence, and you will be one step ahead and become a hot talent!

technology Category Latest News