Free GPU! The spring of machine learning for civilian players is here

2019/09/1904:45:10 technology 1684

In recent years, data science has shown two obvious trends:

1. More and more data analysis and model training are completed through cloud computing

2. Machine learning workflow (English) The name is pipeline) itself is being optimized through algorithms

Cloud computing with Google Colab

Nowadays, almost everyone has their own computer. But laptops and desktop computers are generally only suitable for daily work. Nowadays, the data sets to be processed by machine learning are getting larger and larger, and the requirements for computing power are getting higher and higher. Using cloud computing for machine learning is almost the best choice for ordinary users.

In this article, we will use Jupyter Notebook on the cloud to run a simple data workflow. Using Google's latest black technology Google Colab-a free online Jupyter Notebooks (currently only the Python kernel). The emergence of this product means that no matter where you are, whether you have a personal computer with you, as long as you can connect to the Internet, you can run your own machine learning model. Most of the data science libraries you need are already configured on Google's virtual machine, you can use it directly without configuring the environment, and you can use an NVIDIA Tesla K80 GPU for free! ! !

To use Colab, you only need to be able to connect to the Internet (the domestic students need to go online scientifically) and a Google account. Without further ado, hurry up and follow me to explore Google Colab. The notebook corresponding to this article is here: https://colab.research.google.com/drive/1CIVn-GoOyY3H2_Bv8z09mkNRokQ9jlJ-

Copy it to the Chrome browser and open it directly. You need to log in to your Google account. The interface after opening is as follows, click File>Save a copy in Drive. Then you can open this file on your Drive for editing and running.

It can be said that Google Colab has significantly lowered the threshold for using cloud computing. It is not difficult to imagine that similar online resources will become more and more accessible in the future. For students who have used Jupyter Notebooks on their local computers, this is a good opportunity to transition to cloud computing.

Use TPOT for machine learning automation

Next, I will introduce another artifact to the students-machine learning automation (abbreviated as Auto-ml). It can design and optimize machine learning workflows for specific problems through algorithms. In this article, the machine learning pipeline includes the following steps:

1. Feature preprocessing: fill missing values, zoom, build new features

2. Feature selection: dimensionality reduction

3. Model selection: evaluate multiple models

4. Tuning: find the best model hyperparameter settings

Combine the above four steps, you can get almost unlimited variety of pipelines, and each The best solution to the problem varies. Designing a machine learning pipeline is a very time-consuming and easy process, so we generally cannot traverse all the pipelines, which means that you never know whether the pipeline you design is optimal. At this time, machine learning automation emerged, which can help you evaluate the performance of thousands of possible pipelines and automatically find the optimal (or close to optimal) solution.

Machine learning is only part of data science, and machine learning automation does not mean that it can replace data scientists. On the contrary, machine learning automation can free the hands of data scientists, allowing them to focus on more valuable parts, such as data collection, model interpretation, etc.

There are already many tools for machine learning automation-H20, auto-sklearn, Google Cloud AutoML, and TPOT (Tree-based Pipeline Optimization Tool) that I will focus on next. TPOT is mainly based on the principle of genetic algorithm to find the best machine learning pipeline.

The main benefit of genetic algorithms for building machine learning models is in-depth exploration. For people, even if there is no time limit, they cannot try all the combinations of preprocessing, models, and hyperparameters. After all, personal knowledge and imagination are limited. The genetic algorithm has no initial bias for any machine learning pipeline (humans may generate some biases based on their own experience), and each pipeline will be objectively evaluated. In addition, the fitness function in the genetic algorithm makes the pipeline exploration space, the most potential combination area explored more thoroughly than the poor performance area, which is also a major advantage of genetic algorithm.

Combine the two: machine learning automation on the cloud

Come on, this implementation is actually very simple! With the aforementioned background, we can happily use TPOT on Google Colab to automate machine learning.

Let's try to solve a problem of supervision regression: through the energy data of New York City, we hope to predict the energy star rating of the building. The author once manually performed feature engineering, dimensionality reduction, model selection, and parameter adjustment, and finally trained a Gradient Boosting regression model. The average absolute error on the test set was 9.06. Let's take a look at the performance of the model obtained after automation? The

data set contains dozens of continuous numerical variables (such as building energy usage and building area) and two one-hot coded categorical variables (region name and building type), with a total of 82 features.

First, we need to determine whether TPOT has been installed in the Google Colab environment. Generally speaking, most of the data science packages are already installed. If you want to add a new package, you can use the following command (remember to add "!" in front):

After reading the data, we usually fill in missing values and normalize features. The good news is that in addition to the feature engineering, model selection, and parameter adjustment described above, TPOT will also automatically fill in missing values and perform feature scaling! So, we only need to create the TPOT optimizer in the next step.

Using the default parameters, the TPOT optimizer will create 100 pipelines, and each pipeline will evolve 100 generations to get the score of these 10,000 pipelines. Using ten-fold cross-validation, this means that there will be 100,000 training runs! Even if we are using Google’s computing resources, there will still be a time limit. In order to avoid exceeding the time limit of the Colab server (Google only allows 12 hours of continuous running time), we will set the maximum running time of TPOT to 8 hours. Although the general running time of TPOT is several days, it has been optimized for several hours , We can still get a good model.

We will set the following parameters:

scoring = neg_mean_absolute error: regression performance evaluation indicator

max_time_minutes = 480: limit the running time to 8 hours

n_jobs = -1 : Use all available cores on the computer

● verbosity = 2: Display limited information during training

● cv = 5: Use 5-fold cross validation (default value is 10)

Of course, also There are other parameters that can be set, but they are also suitable for most situations by keeping the default values.No additional settings here.

The grammatical design of the TPOT optimizer is the same as the Scikit-Learn model, so we can use the .fit method to train the optimizer.

During the training process, we obtained the following information:

Due to time constraints, each pipeline can only evolve In the 15th generation, this means that we have evaluated the scores of 1500 different independent pipelines, which is much more than we tried manually!

Once the model is trained, we can check the optimal pipeline through tpot.fitted_pipeline_. We can also save the model to a Python script:

Since we are using Google Colab notebook, if we want to download this pipeline from the server to the local, we need Use Google Colab's file management library:

We can open the tpot_exported_pipeline.py file to view the complete pipeline:

(The download address of this file is at the end of the article)

We can see that the optimizer has filled in missing values for us and established a complete pipeline! The final prediction model is a fusion model (Stacking model) that combines the two algorithms of LassoLarsCV and GradientBoostingRegressor. To be honest, if I train myself, I may not be able to get such a complicated model.

Now, the exciting time has come, let's take a look at the performance of the model on the test set. We can use .score to get the average absolute error:

This project I used to spend several hours manually to complete, and finally got the average absolute error of the Gradient Boosting Regressor model. It is 9.06. Machine learning automation really significantly improves the performance of the final model and also drastically reduces development time.

Summary

In this article, we briefly introduced the use of cloud computing for machine learning and machine learning automation. As long as you have a Google account and can connect to the Internet, you can use Google Colab to develop, run and share machine learning work files. Using TPOT, the optimal machine learning pipeline (including feature preprocessing, model selection, and parameter tuning) can be obtained through automated training and evaluation processes. In addition, we also realize that machine learning automation will not replace data scientists, but it It will allow data scientists to free up more time to spend on more valuable work.

As a newly born thing, TPOT is relatively mature and very easy to use. You guys don’t rush to use this method to try to solve machine learning problems (there are many good projects on Kaggle)! Running an automated machine learning project on Google Colab’s notebook is simply futuristic, and the threshold is actually so. It’s so low, let’s not say, I suddenly wanted to run it on my phone~

Perfect operation!

The relevant file download address mentioned in the article:

https://colab.research.google.com/drive/1CIVn-GoOyY3H2_Bv8z09mkNRokQ9jlJ-

https://github.com/WillKoehrsen/machine-learning-project-walkthrough/blob/master/auto_ml/tpot_exported_pipeline.py

Text-William Koehrsen
Translated-Allen
Original-https://towardsdatascience.com/automated-machine -learning-on-the-cloud-in-python-47cf568859f

Predicting housing prices, dog identification, and degrading analysis, Silicon Valley’s superb practical projects are waiting for you to challenge. Click on the card below, Sebastian, the founder of Udacity, will personally teach you the very important modeling and algorithm foundations in artificial intelligence, and you will be one step ahead and become a hot talent!

technology

After new energy has become a trend, domestic brands have become more creative. Unlike in the past, brands dare to fight hard and launch different types of new cars, such as MPV models. Early MPVs were niche models in the country, with a small market and dominated by joint ventur - DayDayNews

After new energy has become a trend, domestic brands have become more creative. Unlike in the past, brands dare to fight hard and launch different types of new cars, such as MPV models. Early MPVs were niche models in the country, with a small market and dominated by joint ventur

Maxus has determined the development trend of MPV transformation into pure electric vehicles, and MIFA 5 takes the lead

03/18 2138

I have driven many four-wheel drive cars, and even many front-wheel drive cars. At the end of the drive, I came to a very valuable point: that is, for ordinary people, four-wheel drive is the most useless configuration. - DayDayNews

I have driven many four-wheel drive cars, and even many front-wheel drive cars. At the end of the drive, I came to a very valuable point: that is, for ordinary people, four-wheel drive is the most useless configuration.

Take my advice: When ordinary people buy a car, they must not choose four-wheel drive!

02/21 1383

LeTV, a great company, made a huge announcement on the first day of 2023, which not only shocked my jaw, but also refreshed my knowledge. LeTV released the "2023 Letter to All Employees". The general content is that the company will implement a four-and-a-half-day work week start - DayDayNews

LeTV, a great company, made a huge announcement on the first day of 2023, which not only shocked my jaw, but also refreshed my knowledge. LeTV released the "2023 Letter to All Employees". The general content is that the company will implement a four-and-a-half-day work week start

LeTV’s shocking revelation not only shocked my jaw, but also refreshed my knowledge.

01/07 1215

Based on the delivery and sales statistics officially disclosed by some new energy vehicle brands, the High-tech Industry Research Institute (GGII) showed that GAC Aian sold approximately 30,000 vehicles in December, and its annual sales exceeded 271,000 vehicles. - DayDayNews

Based on the delivery and sales statistics officially disclosed by some new energy vehicle brands, the High-tech Industry Research Institute (GGII) showed that GAC Aian sold approximately 30,000 vehicles in December, and its annual sales exceeded 271,000 vehicles.

GGII: Interpretation of new energy vehicle delivery and sales in December

01/07 2822

(Source: pexels website) With the release of the epidemic, people's demand for consultations has increased rapidly. Due to the surge in pressure on hospital services, online consultations have become an effective means for many people to replace going to the hospital, and online - DayDayNews

(Source: pexels website) With the release of the epidemic, people's demand for consultations has increased rapidly. Due to the surge in pressure on hospital services, online consultations have become an effective means for many people to replace going to the hospital, and online

Online consultations are growing explosively. How can aggregated payment and split accounts help accelerate the development of medical platforms?

10/27 1865

Today is January 3, 2023. On this day in 2001, Intel released the Intel 1.3GHz Pentium 4 processor; - DayDayNews

Today is January 3, 2023. On this day in 2001, Intel released the Intel 1.3GHz Pentium 4 processor;

Apple Incorporated | Today in History

10/27 1505

In this article, Zhuang Biaowei from Kaiyuan Society starts from the open source topics that developers are concerned about in 2022, such as open source events, open source business, open source security, open source technology, etc., and explores the new trends of open source no - DayDayNews

In this article, Zhuang Biaowei from Kaiyuan Society starts from the open source topics that developers are concerned about in 2022, such as open source events, open source business, open source security, open source technology, etc., and explores the new trends of open source no

The dusk of personal heroism is approaching, 6 major current situations reveal the year of open source

10/27 1467

Recently, Baidu Netdisk launched a new product called "Baidu Netdisk Youth Edition", which provides 10GB of free storage space and undifferentiated upload and download services. - DayDayNews

Recently, Baidu Netdisk launched a new product called "Baidu Netdisk Youth Edition", which provides 10GB of free storage space and undifferentiated upload and download services.

Experience the new Baidu Netdisk Youth Edition

10/27 1084

technology

Among them, after Lu Weibing joined Xiaomi in early 2019, he has been dedicated and well-known for his hard work. He went from the head of the Redmi brand to the president of China, and then led the business in the three major theaters of China, International Department and India

Lei Jun’s internal letter: Xiaomi’s key words in 2023 are “steady advancement and ready to go”

10/27 1884

Want to be a vlogger but don't know how to do voiceovers? I don’t know if you have noticed that many short videos now use dubbing. An interesting and pleasant dubbing can add a lot of color to the video. Therefore, many short video bloggers will carefully choose dubbing software. - DayDayNews

Want to be a vlogger but don't know how to do voiceovers? I don’t know if you have noticed that many short videos now use dubbing. An interesting and pleasant dubbing can add a lot of color to the video. Therefore, many short video bloggers will carefully choose dubbing software.

Blow it up! 4 dubbing tools strongly recommended by well-known Internet celebrities, which are simply the ceiling of the dubbing industry

10/27 1727