Leifeng.com Note: The original author of this article is Yang Xi, and this article is originally published in personal blog .
has written an article "One framework solves almost all machine learning problems" before, but there are no specific examples and codes. Today I saw a good codeExploratory Tutorial - Titanic on kaggle to parse it. You can click this link directly in the source code.
In this article, we can learn a complete process of using machine learning to solve analytical problems. It includes the general process of solving problems, the common methods of descriptive statistics in , the common methods of data cleaning, how to think about other influencing factors from a given ordinary variable heuristic, the general process of building a model in sklearn, and how to use the very popular ensemble learning.
Go to the topic below:
In the Titanic: Machine Learning from Disaster, what needs to be solved is to determine which passengers are more likely to survive based on the data provided by age, sex and other factors, so this is a classification problem.
When solving machine learning problems, it generally includes the following process:
Data Exploration
Data Cleaning
Feature Engineering
Model Building
Ensemble Learning
predicttml8
1. Data Exploration
This part first imports the commonly used Numpy, Pandas, Matplotlib Send packages, import training sets and test sets:
import numpy as npiport pandas as pdimport matplotlib.pyplot as plt%matplotlib inline
train = pd.read_csv('../input/train.csv')te st = pd.read_csv('../input/test.csv')
After
, you can use the following command to observe the structure of the data table first:
train.tailtest.headtrain.describe
Next, you can observe the distribution of each variable:
The distribution of each variable in the test set and the training set is similar.

Then look at the impact of each variable on the classification tag:
For example, the impact of gender. Through visualization, it can be found that there are more women than men in the surviving passengers.
or Pclass.

2. Data In the section Cleaning
, you can count the missing values of each variable:
train.isnull.sum#test.isnull.sumPassengerId 0Survived 0Pclass 0Name 0Sex 0Age 177SibSp � Age, if it is a discrete variable, you can find a similar data group and then take the average of the most, or most, groups. The two missing values of eg and Embarked can be seen in Pclass 1 and Fare 80. The most common situation is Embarked=C.3. Feature Engineering
There was an article about how to do a feature engineering before, but it only introduced some concepts. This example is more inspiring. See how to expand into a more influential feature through a given number, how to associate new factors based on actual conditions and convert them into digital expressions.
Below are the original variables in the data, and see what factors can be associated with them.
pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)name Namesex Sexage Agesibsp Number of Siblings/Spouses Aboardparch Number of Parents/Childre n Aboardticket Ticket Numberfare For example, the words like this can be captured in
Name to reflect the passenger's workplace status: [‘Capt’, ‘Col’, ‘Major’, ‘Dr’, ‘Officer’, ‘Rev’]. [a-zA-Z] in
Cabin may reflect social status. [0-9] in
Cabin may represent the geographical location of the cabin.
SibSp can calculate the number of members of the same family among passengers.
title[title.isin(['Capt', 'Col', 'Major', 'Dr', 'Officer', 'Rev'])] = 'Officer'deck = full[~full.Cabin.isnull()].Cabin.map( lambda x: re.compile("([a-zA-Z]+)").search(x).group)checker = re.compile("([0-9]+)")full['Group_num'] = full.Parch + full.SibSp + 1
In this link, it is also necessary to convert the category data into The form of dummy variable, that is, it is converted into a vector format. It belongs to the first category and is 1 at the first location, and the remaining locations are 0.
to normalize the continuous data, that is, to narrow the range of large-scale changes to between 0~1 or -1~1. Then drop the irrelevant variable.
train = pd.get_dummies(train, columns=['Embarked', 'Pclass', 'Title', 'Group_size'])full['NorFare'] = pd.Series(scaler.fit_transform(full.Fare.res hape(-1,1)).reshape(-1), index=full.index)full.drop(labels=['PassengerId', 'Name', 'Cabin', 'Survived', 'Ticket', 'Fare'], axis=1, inplace=True)
4. Model Building
first divides the data into a training set and a test set, using train_test_split,
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
Because many models will be used later, you can write the cross validation and fit parts into a function, so that the classifier is put into the function for training each time, and finally return the trained model.
from sklearn.model_selection import GridSearchCVfrom sklearn.metrics import make_scorerfrom sklearn.metrics import accuracy_scorescoring = make_score(accuracy_score, greater_is_better=True)defget_model(e stimator, parameters, X_train, y_train, scoring): model = GridSearchCV(estimator, param_grid=parameters, scoring=scoring) model.fit(X_train, y_train) return model.best_estimator_
KNN as an example, let’s take a look at the process of establishing training and using model prediction. After
imports the classifier model from sklearn, define a KNN,
defines the appropriate parameter set parameters,
and then use get_model to train the KNN model,
Next, use the trained model to predict the data of the test set, and obtain accuracy_score,
and then draw learning_curve.
from sklearn.neighbors import KNeighborsClassifierKNN = KNeighborsClassifier(weights='uniform')parameters = {'n_neighbors':[3,4,5], 'p':[1,2]}clf_knn = get_model( KNN, parameters, X_train, y_train, scoring)print (accuracy_score(y_test, clf_knn.predict(X_test)))plot_learning_curve(clf_knn, 'KNN', X, y, cv=4);

Use the above method, try multiple models, and print out their accuracy_score:
KNN, 0.816143497758Random Forest, 0.829596412556 Random Forest, 0.834080717489Logistic Regression, 0.811659192825SVC, 0.838565022422XGBoost, 0.820627802691123456123456
5. Ensemble
Next, use VotingClassifier to integrate the previous training classifiers and then fit to train them. accuracy_score and draw learning_curve.
from sklearn.ensemble import VotingClassifierclf_vc = VotingClassifier(estimators=[('xgb1', clf_xgb1), ('lg1', clf_lg1), ('svc', clf_svc), ('rfc1', clf_rfc1),('rfc2', clf_rfc2), ('knn', clf_knn)], voting='hard', weights=[4,1,1,1,1,1,2])clf_vc = clf_vc.fit(X_train, y_train)print (accuracy_score(y_test, clf_vc.predict(X_test)))plot_learning_curve(clf_vc, 'Ensemble', X, y, cv=4);ensemble, 0.825112107623
6. Prediction
uses the last trained model to predict the given test set file, and make the data in the specified format and save it in csv to submit.
defsubmission(model, fname, X): ans = pd.DataFrame(columns=['PassengerId', 'Survived']) ans.PassengerId = PassengerId ans.Survived = pd.Series(model.predict(X), index=ans.index) ans.to_csv(fname, index=False)