* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca

2024/06/2700:19:34 technology 1457

This article is about 4300 words . It is recommended to read * The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews minutes

This article explores the most commonly used Standardization in feature scaling.

is written in front of

Feature scaling. Common formulations of include "feature normalization" and "standardization". It is an important technology in data preprocessing. Sometimes it even determines whether the algorithm can work and whether it works well. When it comes to the necessity of feature scaling, the two most commonly used examples may be:

  • The units (scales) between features may be different, such as height and weight, such as Celsius and Fahrenheit, such as house area and number of rooms, a feature The change range may be [1000, 10000], and the change range of another feature may be [−0.1, 0.2]. When performing distance-related calculations, different units will lead to different calculation results, and large-scale features will be decisive. The role of small-scale features may be ignored. In order to eliminate the impact of unit and scale differences between features and treat each dimension of features equally, needs to normalize features.
  • Under the original features, due to the difference in scale, the contour plot of the loss function of may be elliptical . The gradient direction is perpendicular to the contour line, and the descent will take the zigzag route instead of pointing to the local minimum. After zero-mean and unit-variance transformation of the features, the contour plot of the loss function is closer to a circle, the direction of gradient descent oscillates smaller, and the convergence is faster, as shown in the figure below, the picture is from Andrew Ng.

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews

Feature Scaling from Andrew Ng

Standardization, which is most commonly used in feature scaling, seems to be "no brainer". This article would like to explore more why.

  • What are the commonly used feature scaling methods?
  • What feature scaling method should be used under what circumstances? Are there any guiding ideas?
  • Do all machine learning algorithms require feature scaling? Are there any exceptions? Are the contour plots of
  • loss functions all ellipses or concentric circles? Can you briefly explain the function of feature scaling using ellipses and circles?
  • If the contour plot of the loss function is complicated, is there any other intuitive explanation for feature scaling?

Based on the information obtained, this article will try to answer the above questions. However, the author’s ability is limited and there is no confusion. Let’s just go as far as I can (smile).

Commonly used feature scaling methods

Before asking why, first look at what it is.

Given the data set, let the feature vector be x, the dimension be D, and the number of samples be R. A D×R matrix can be formed, with one column for one sample and one row for one-dimensional features, as shown in the figure below. Picture from Hung-yi Lee pdf-Gradient Descent:

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews

feature matrix

feature scaling method can be divided into 2 categories, row by row and column by column. The row-by-row operation is performed on each dimension of features, and the column-by-column operation is performed on each sample. The above figure is an example of feature standardization in row-by-row operation.

Specifically, the commonly used feature scaling method is as follows, from wiki,

  • Rescaling (min-max normalization, range scaling):

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews

linearly maps each dimensional feature to the target range [a, b], that is, the minimum value is mapped to a, the maximum value Mapping to b, the common target ranges are [0,1] and [−1,1]. In particular, the calculation method for mapping to [0,1] is:

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews

  • Mean normalization:

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews

maps the mean to 0, and uses the maximum value The difference between the minimum values ​​normalizes the features. A more common approach is to normalize by the standard deviation, as follows.

  • Standardization (Z-score Normalization):

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews

per dimension features 0 mean 1 variance (zero-mean and unit-variance).

  • Scaling to unit length:

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews

Divide the feature vector of each sample by its length, that is, normalize the length of the sample feature vector. The length measurement is often L2 norm ( Euclidean distance ), and sometimes Using L1 norm, a comparison of different measurement methods can be found in the paper "CVPR2005-Histograms of Oriented Gradients for Human Detection".

The above four feature scaling methods, the first three are row-by-row operations, and the last one is column-by-column operations.

What is confusing is the confusion of references. The reference of Standardization is relatively clear, but when talking about Normalization alone, sometimes it refers to min-max normalization, sometimes it refers to Standardization, and sometimes it refers to Scaling to unit length. Comparative analysis of the calculation methods of

The first three calculation methods of feature scaling are minus a statistic and divided by a statistic , and the last one is divided by the length of the vector itself.

  • minus a statistic can be seen as which value chooses as the origin, whether it is the minimum value or the mean, and translates the entire data set to this new origin position. If different biases between features have a negative impact on the subsequent process, this operation is beneficial and can be regarded as some kind of bias-independent operation; if the original feature value has special meaning, such as sparsity, this operation may destroy it sparsity. Dividing
  • by a statistic can be regarded as scaling the feature in the direction of the coordinate axis. It is used to reduce the impact of the feature scale in and can be regarded as some kind of scale-independent operation. scaling can use the span between the maximum and minimum values, or the standard deviation (the average distance to the center point). The former is sensitive to outliers. The impact of outliers on the latter is related to the number of outliers and the size of the data set. The fewer outliers the data set is. The bigger it is, the smaller the impact.
  • divided by the length is equivalent to normalizing the length. maps all samples to the unit ball , which can be regarded as some kind of length-independent operation. , for example, the word frequency feature needs to remove the influence of article length, image processing Some features need to remove the influence of light intensity, and facilitate the calculation of cosine distance or inner product similarity. For more data preprocessing content related to

sparse data and outliers, please refer to scikit learn-5.3. Preprocessing data.

Observe the effect of the above method geometrically. The picture comes from CS231n-Neural Networks Part 2: Setting up the Data and the Loss. Zero-mean translates the data set to the origin, and unit-variance makes the span on each dimension feature equivalent. Figure It can be clearly seen that there is a linear correlation between the two-dimensional features, and the Standardization operation does not eliminate this correlation.

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews

Standardization

can remove the linear correlation (decorrelation) through the PCA method, that is, introduce rotation, find the new coordinate axis direction, and use "standard deviation" to scale in the new coordinate axis direction, as shown in the figure below, the picture is from the link, in the picture At the same time, the function of unit length is described - mapping all samples to the unit ball.

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews

Effect of the operations of standardization and length normalization

When the feature dimensions are more, the comparison is as follows, the picture is from youtube,

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews

feature scaling comparison

In general, the purpose of normalization/standardization is to obtain some kind of "irrelevance" - —Offset independent, scale independent, length independent... When the physical meaning and geometric meaning behind the normalization/standardization method fit the needs of the current problem, it will have a positive effect on solving the problem, and vice versa. reaction.Therefore, "when to choose which method" depends on the problem to be solved, that is, problem-dependent.

feature scaling is needed or not

The picture below comes from data school-Comparing supervised learning algorithms, which compares several supervised learning algorithms. The two columns on the right are whether feature scaling is required.

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews

Comparing supervised learning algorithms

Let’s analyze it in detail below. When does

need feature scaling?

  • involves or implies distance calculation algorithms, such as K-means, KNN, PCA, SVM, etc., generally require feature scaling, because:

zero-mean can generally increase the difference in cosine distance or inner product results between samples, distinguishes The force is stronger. Assume that the data set is concentrated in the far upper right corner of the first quadrant and is translated to the origin. It can be imagined that the difference in cosine distance between samples is amplified. In template matching, zero-mean can significantly improve the distinction of the response results.

As far as Euclidean distance is concerned, increases the scale of a certain feature, which is equivalent to increasing its weight in distance calculation. If there is clear prior knowledge of and indicates that a certain feature is important, then it is possible to increase its weight appropriately. There is a positive effect, but if there is no such prior, or the purpose is to know which features are more important, then feature scaling is needed first, and each dimensional feature is considered.

increases the scale and also increases the variance in the dimension of the feature. The PCA algorithm tends to focus on the coordinate axis direction of the feature with a larger variance, and other features may be ignored. Therefore, the effect of Standardization before PCA may be better. OK, as shown in the picture below, the picture comes from scikit learn-Importance of Feature Scaling.

  • PCA and Standardization
  • When the loss function contains the regular term , feature scaling is generally required: for the linear model y=wx+b, any linear transformation (translation, scaling) of x can be "absorbed" by w and b Dropping it, in theory, will not affect the fitting ability of the model. However, if the loss function contains regular terms, such as λ∣∣w∣∣^2, λ is a hyperparameter, which imposes the same penalty on each parameter of w, but for a certain dimensional feature xi, its scale is larger The larger the coefficient wi is, the smaller its proportion in the regular term will be, which is equivalent to a smaller penalty for wi. That is, the loss function will relatively ignore those features with increased scale. This is unreasonable, so feature scaling is required. Make the loss function treat each dimensional feature equally.
  • gradient descent algorithm requires feature scaling. The parameter update formula of gradient descent is as follows.

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews

E(W) is the loss function. The convergence speed of depends on: the distance from the initial position of the parameter to the local minima, and the size of the learning rate eta. In the one-dimensional case of , is near the local minima. The influence of different learning rates on gradient descent is shown in the figure below:

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews

Gradient descent for different learning rates

In the case of multi-dimensionality, it can be decomposed into multiple figures above, and each dimension decreases separately. The parameter W is a vector, but there is only one learning rate, that is, all parameter dimensions share the same learning rate (the algorithm that assigns a separate learning rate to each dimension is not considered for the time being). Convergence means achieving a minimum value in each parameter dimension, and the partial derivative in each parameter dimension is 0, but the rate of decline in each parameter dimension is different. In order to converge in each dimension, learning The rate should be the smallest of all dimensions in the appropriate step size at the current position. The following discusses the effect of feature scaling on gradient descent.

    • zero center cooperates with parameter initialization to shorten the distance between the initial parameter position and the local minimum and speed up convergence.The final parameters of the model are unknown, so they are generally initialized randomly, such as sampling from the uniform distribution of 0 mean or the Gaussian distribution of and . For linear models, the initial position of the interface is roughly near the origin, and bias is often initialized as 0, the interface passes directly through the origin. At the same time, in order to converge, the learning rate will not be very large. The feature distribution of each data set is different. If the distribution is concentrated and far away from the origin, for example, it is located in the far upper right corner of the first quadrant, the interface may take many steps to "climb" to the location of the data set. . Therefore, no matter what the data set is, first translate to the origin, and then cooperate with parameter initialization to ensure that the interface will definitely pass through the data set. In addition, outliers are often distributed at the periphery of the data set. Compared with moving the interface from the outside inward, moving from the central area may be less affected by outliers.
    • For the linear model using mean square error loss LMS, the loss function is exactly second order, as shown in the figure below


  • The descending speed changes in different directions are different (different second-order derivatives, different curvatures) , precisely by the input covariance matrix Determined, changes the shape of the loss function through scaling to reduce the difference in curvature in different directions. decomposes the decrease in each dimension. Given a decrease step size, if it is not small enough, some dimensions decrease more, some decrease less, and some may still increase. The overall performance of the loss function may be It may rise or fall, and it will be unstable. After scaling, the curvatures in different directions are relatively closer, making it easier to select an appropriate learning rate, making the descent process relatively more stable.
    • also has an understanding from the perspective of Hessian matrix eigenvalues ​​and condition number. For details, see the Convergence of Gradient Descent section in Lecun paper-Efficient BackProp. It has a clear mathematical description. It also introduces the role of whitening-removing feature gaps. The linear correlation allows the gradient descent in each dimension to be viewed independently.
    • The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex. It cannot be simply described by ellipses and circles, so using it to explain the effect of feature scaling on the gradient descent of all loss functions seems to be too simplistic. See Hinton vedio-3.2 The error surface for a linear neuron.
    • For the case where the loss function is not the mean square error, as long as there is a multiplicative relationship between the weight w and the input feature x, the partial derivative of the loss function on w must contain the factor x, and the gradient descent speed of w will be affected by the scale of the feature x. Theoretically, setting an adaptive learning rate for each parameter can absorb the impact of the x-scale. However, in practice, due to calculation considerations, all parameters often share a learning rate. In this case, different x-scales may cause As a result, the descent speed in different directions varies greatly, the learning rate is not easy to select, and the descent process may also be unstable. Scaling can control the descent speed in different directions, making the descent process relatively more stable.
  • For the traditional neural network , it is also very important to do feature scaling on the input, because using activation functions with saturation zones such as sigmoid, if the input distribution range is very wide and the parameters are not properly adapted during initialization, it is easy to fall directly into the saturation zone. , causing the gradient to disappear, so the input needs to be Standardized or mapped to [0,1], [−1,1], and a carefully designed parameter initialization method can be used to control the value range. But since the introduction of Batch Normalization, every time linear transformation changes the feature distribution, Normalization will be performed again. It seems that there is no need to perform feature scaling on the input of the network? But it is customary to still do feature scaling.When does

not need Feature Scaling?

Probabilistic models that have nothing to do with distance calculations do not require feature scaling, such as Naive Bayes;

Tree-based models that have nothing to do with distance calculations do not require feature scaling, such as decision trees , random forests, etc. The selection of nodes in the tree only requires Paying attention to where the current feature is divided is better for classification, that is, it only cares about the relative size within the feature, and has nothing to do with the relative size between features.

Summary

This article was very difficult to write. I thought it was quite simple and straightforward at first, but as the exploration deepened, more and more question marks emerged, breaking many of the original "taken for granted". Therefore, in the process of writing I keep doing additions, and I want to explain many places as intuitively as possible without copying too many formulas, but my understanding is not deep enough, which makes the narrative so lengthy now. I hope I can be more focused and concise when writing in the future.

Reference

.wiki-Feature scaling

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews.wiki-Backpropagation

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews.[Hung-yi Lee pdf-Gradient Descent](http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Gradient Descent (v2).pdf)

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews.quora-Why does mean normalization help in gradient descent?

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews.scikit learn-Importance of Feature Scaling

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews.scikit learn-5.3. Preprocessing data

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews.scikit learn-Compare the effect of different scalers on data with outliers

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews .data school-Comparing supervised learning algorithms

* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca - DayDayNews.Lecun paper-Efficient BackProp

0.Hinton vedio-3.2 The error surface for a linear neuron

1.CS231n-Neural Networks Part 2: Setting up the Data and the Loss

2.ft p-Should I normalize/standardize/ rescale the data?

3.medium-Understand Data Normalization in Machine Learning

4.Normalization and Standardization

5.How and why do normalization and feature scaling work?

6.Is it a good practice to always scale/normalize data for machine learning?

7.When conducting multiple regression, when should you center your predictor variables & when should you standardize them?


technology Category Latest News