* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca

2024/06/2700:19:34 technology 1457

This article is about 4300 words . It is recommended to read minutes

This article explores the most commonly used Standardization in feature scaling.

is written in front of

Feature scaling. Common formulations of include "feature normalization" and "standardization". It is an important technology in data preprocessing. Sometimes it even determines whether the algorithm can work and whether it works well. When it comes to the necessity of feature scaling, the two most commonly used examples may be:

The units (scales) between features may be different, such as height and weight, such as Celsius and Fahrenheit, such as house area and number of rooms, a feature The change range may be [1000, 10000], and the change range of another feature may be [−0.1, 0.2]. When performing distance-related calculations, different units will lead to different calculation results, and large-scale features will be decisive. The role of small-scale features may be ignored. In order to eliminate the impact of unit and scale differences between features and treat each dimension of features equally, needs to normalize features.
Under the original features, due to the difference in scale, the contour plot of the loss function of may be elliptical . The gradient direction is perpendicular to the contour line, and the descent will take the zigzag route instead of pointing to the local minimum. After zero-mean and unit-variance transformation of the features, the contour plot of the loss function is closer to a circle, the direction of gradient descent oscillates smaller, and the convergence is faster, as shown in the figure below, the picture is from Andrew Ng.

Feature Scaling from Andrew Ng

Standardization, which is most commonly used in feature scaling, seems to be "no brainer". This article would like to explore more why.

What are the commonly used feature scaling methods?
What feature scaling method should be used under what circumstances? Are there any guiding ideas?
Do all machine learning algorithms require feature scaling? Are there any exceptions? Are the contour plots of
loss functions all ellipses or concentric circles? Can you briefly explain the function of feature scaling using ellipses and circles?
If the contour plot of the loss function is complicated, is there any other intuitive explanation for feature scaling?

Based on the information obtained, this article will try to answer the above questions. However, the author’s ability is limited and there is no confusion. Let’s just go as far as I can (smile).

Commonly used feature scaling methods

Before asking why, first look at what it is.

Given the data set, let the feature vector be x, the dimension be D, and the number of samples be R. A D×R matrix can be formed, with one column for one sample and one row for one-dimensional features, as shown in the figure below. Picture from Hung-yi Lee pdf-Gradient Descent:

feature matrix

feature scaling method can be divided into 2 categories, row by row and column by column. The row-by-row operation is performed on each dimension of features, and the column-by-column operation is performed on each sample. The above figure is an example of feature standardization in row-by-row operation.

Specifically, the commonly used feature scaling method is as follows, from wiki,

Rescaling (min-max normalization, range scaling):

linearly maps each dimensional feature to the target range [a, b], that is, the minimum value is mapped to a, the maximum value Mapping to b, the common target ranges are [0,1] and [−1,1]. In particular, the calculation method for mapping to [0,1] is:

Mean normalization:

maps the mean to 0, and uses the maximum value The difference between the minimum values normalizes the features. A more common approach is to normalize by the standard deviation, as follows.

Standardization (Z-score Normalization):

per dimension features 0 mean 1 variance (zero-mean and unit-variance).

Scaling to unit length:

Divide the feature vector of each sample by its length, that is, normalize the length of the sample feature vector. The length measurement is often L2 norm ( Euclidean distance ), and sometimes Using L1 norm, a comparison of different measurement methods can be found in the paper "CVPR2005-Histograms of Oriented Gradients for Human Detection".

The above four feature scaling methods, the first three are row-by-row operations, and the last one is column-by-column operations.

What is confusing is the confusion of references. The reference of Standardization is relatively clear, but when talking about Normalization alone, sometimes it refers to min-max normalization, sometimes it refers to Standardization, and sometimes it refers to Scaling to unit length. Comparative analysis of the calculation methods of

The first three calculation methods of feature scaling are minus a statistic and divided by a statistic , and the last one is divided by the length of the vector itself.

minus a statistic can be seen as which value chooses as the origin, whether it is the minimum value or the mean, and translates the entire data set to this new origin position. If different biases between features have a negative impact on the subsequent process, this operation is beneficial and can be regarded as some kind of bias-independent operation; if the original feature value has special meaning, such as sparsity, this operation may destroy it sparsity. Dividing
by a statistic can be regarded as scaling the feature in the direction of the coordinate axis. It is used to reduce the impact of the feature scale in and can be regarded as some kind of scale-independent operation. scaling can use the span between the maximum and minimum values, or the standard deviation (the average distance to the center point). The former is sensitive to outliers. The impact of outliers on the latter is related to the number of outliers and the size of the data set. The fewer outliers the data set is. The bigger it is, the smaller the impact.
divided by the length is equivalent to normalizing the length. maps all samples to the unit ball , which can be regarded as some kind of length-independent operation. , for example, the word frequency feature needs to remove the influence of article length, image processing Some features need to remove the influence of light intensity, and facilitate the calculation of cosine distance or inner product similarity. For more data preprocessing content related to

sparse data and outliers, please refer to scikit learn-5.3. Preprocessing data.

Observe the effect of the above method geometrically. The picture comes from CS231n-Neural Networks Part 2: Setting up the Data and the Loss. Zero-mean translates the data set to the origin, and unit-variance makes the span on each dimension feature equivalent. Figure It can be clearly seen that there is a linear correlation between the two-dimensional features, and the Standardization operation does not eliminate this correlation.

Standardization

can remove the linear correlation (decorrelation) through the PCA method, that is, introduce rotation, find the new coordinate axis direction, and use "standard deviation" to scale in the new coordinate axis direction, as shown in the figure below, the picture is from the link, in the picture At the same time, the function of unit length is described - mapping all samples to the unit ball.

Effect of the operations of standardization and length normalization

When the feature dimensions are more, the comparison is as follows, the picture is from youtube,

feature scaling comparison

In general, the purpose of normalization/standardization is to obtain some kind of "irrelevance" - —Offset independent, scale independent, length independent... When the physical meaning and geometric meaning behind the normalization/standardization method fit the needs of the current problem, it will have a positive effect on solving the problem, and vice versa. reaction.Therefore, "when to choose which method" depends on the problem to be solved, that is, problem-dependent.

feature scaling is needed or not

The picture below comes from data school-Comparing supervised learning algorithms, which compares several supervised learning algorithms. The two columns on the right are whether feature scaling is required.

Comparing supervised learning algorithms

Let’s analyze it in detail below. When does

need feature scaling?

involves or implies distance calculation algorithms, such as K-means, KNN, PCA, SVM, etc., generally require feature scaling, because:

zero-mean can generally increase the difference in cosine distance or inner product results between samples, distinguishes The force is stronger. Assume that the data set is concentrated in the far upper right corner of the first quadrant and is translated to the origin. It can be imagined that the difference in cosine distance between samples is amplified. In template matching, zero-mean can significantly improve the distinction of the response results.

As far as Euclidean distance is concerned, increases the scale of a certain feature, which is equivalent to increasing its weight in distance calculation. If there is clear prior knowledge of and indicates that a certain feature is important, then it is possible to increase its weight appropriately. There is a positive effect, but if there is no such prior, or the purpose is to know which features are more important, then feature scaling is needed first, and each dimensional feature is considered.

increases the scale and also increases the variance in the dimension of the feature. The PCA algorithm tends to focus on the coordinate axis direction of the feature with a larger variance, and other features may be ignored. Therefore, the effect of Standardization before PCA may be better. OK, as shown in the picture below, the picture comes from scikit learn-Importance of Feature Scaling.

PCA and Standardization
When the loss function contains the regular term , feature scaling is generally required: for the linear model y=wx+b, any linear transformation (translation, scaling) of x can be "absorbed" by w and b Dropping it, in theory, will not affect the fitting ability of the model. However, if the loss function contains regular terms, such as λ∣∣w∣∣^2, λ is a hyperparameter, which imposes the same penalty on each parameter of w, but for a certain dimensional feature xi, its scale is larger The larger the coefficient wi is, the smaller its proportion in the regular term will be, which is equivalent to a smaller penalty for wi. That is, the loss function will relatively ignore those features with increased scale. This is unreasonable, so feature scaling is required. Make the loss function treat each dimensional feature equally.
gradient descent algorithm requires feature scaling. The parameter update formula of gradient descent is as follows.

E(W) is the loss function. The convergence speed of depends on: the distance from the initial position of the parameter to the local minima, and the size of the learning rate eta. In the one-dimensional case of , is near the local minima. The influence of different learning rates on gradient descent is shown in the figure below:

Gradient descent for different learning rates

In the case of multi-dimensionality, it can be decomposed into multiple figures above, and each dimension decreases separately. The parameter W is a vector, but there is only one learning rate, that is, all parameter dimensions share the same learning rate (the algorithm that assigns a separate learning rate to each dimension is not considered for the time being). Convergence means achieving a minimum value in each parameter dimension, and the partial derivative in each parameter dimension is 0, but the rate of decline in each parameter dimension is different. In order to converge in each dimension, learning The rate should be the smallest of all dimensions in the appropriate step size at the current position. The following discusses the effect of feature scaling on gradient descent.

zero center cooperates with parameter initialization to shorten the distance between the initial parameter position and the local minimum and speed up convergence.The final parameters of the model are unknown, so they are generally initialized randomly, such as sampling from the uniform distribution of 0 mean or the Gaussian distribution of and . For linear models, the initial position of the interface is roughly near the origin, and bias is often initialized as 0, the interface passes directly through the origin. At the same time, in order to converge, the learning rate will not be very large. The feature distribution of each data set is different. If the distribution is concentrated and far away from the origin, for example, it is located in the far upper right corner of the first quadrant, the interface may take many steps to "climb" to the location of the data set. . Therefore, no matter what the data set is, first translate to the origin, and then cooperate with parameter initialization to ensure that the interface will definitely pass through the data set. In addition, outliers are often distributed at the periphery of the data set. Compared with moving the interface from the outside inward, moving from the central area may be less affected by outliers.
For the linear model using mean square error loss LMS, the loss function is exactly second order, as shown in the figure below

The descending speed changes in different directions are different (different second-order derivatives, different curvatures) , precisely by the input covariance matrix Determined, changes the shape of the loss function through scaling to reduce the difference in curvature in different directions. decomposes the decrease in each dimension. Given a decrease step size, if it is not small enough, some dimensions decrease more, some decrease less, and some may still increase. The overall performance of the loss function may be It may rise or fall, and it will be unstable. After scaling, the curvatures in different directions are relatively closer, making it easier to select an appropriate learning rate, making the descent process relatively more stable.
also has an understanding from the perspective of Hessian matrix eigenvalues and condition number. For details, see the Convergence of Gradient Descent section in Lecun paper-Efficient BackProp. It has a clear mathematical description. It also introduces the role of whitening-removing feature gaps. The linear correlation allows the gradient descent in each dimension to be viewed independently.
The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex. It cannot be simply described by ellipses and circles, so using it to explain the effect of feature scaling on the gradient descent of all loss functions seems to be too simplistic. See Hinton vedio-3.2 The error surface for a linear neuron.
For the case where the loss function is not the mean square error, as long as there is a multiplicative relationship between the weight w and the input feature x, the partial derivative of the loss function on w must contain the factor x, and the gradient descent speed of w will be affected by the scale of the feature x. Theoretically, setting an adaptive learning rate for each parameter can absorb the impact of the x-scale. However, in practice, due to calculation considerations, all parameters often share a learning rate. In this case, different x-scales may cause As a result, the descent speed in different directions varies greatly, the learning rate is not easy to select, and the descent process may also be unstable. Scaling can control the descent speed in different directions, making the descent process relatively more stable.
For the traditional neural network , it is also very important to do feature scaling on the input, because using activation functions with saturation zones such as sigmoid, if the input distribution range is very wide and the parameters are not properly adapted during initialization, it is easy to fall directly into the saturation zone. , causing the gradient to disappear, so the input needs to be Standardized or mapped to [0,1], [−1,1], and a carefully designed parameter initialization method can be used to control the value range. But since the introduction of Batch Normalization, every time linear transformation changes the feature distribution, Normalization will be performed again. It seems that there is no need to perform feature scaling on the input of the network? But it is customary to still do feature scaling.When does

not need Feature Scaling?

Probabilistic models that have nothing to do with distance calculations do not require feature scaling, such as Naive Bayes;

Tree-based models that have nothing to do with distance calculations do not require feature scaling, such as decision trees , random forests, etc. The selection of nodes in the tree only requires Paying attention to where the current feature is divided is better for classification, that is, it only cares about the relative size within the feature, and has nothing to do with the relative size between features.

Summary

This article was very difficult to write. I thought it was quite simple and straightforward at first, but as the exploration deepened, more and more question marks emerged, breaking many of the original "taken for granted". Therefore, in the process of writing I keep doing additions, and I want to explain many places as intuitively as possible without copying too many formulas, but my understanding is not deep enough, which makes the narrative so lengthy now. I hope I can be more focused and concise when writing in the future.

Reference

.wiki-Feature scaling

.wiki-Backpropagation

.[Hung-yi Lee pdf-Gradient Descent](http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Gradient Descent (v2).pdf)

.quora-Why does mean normalization help in gradient descent?

.scikit learn-Importance of Feature Scaling

.scikit learn-5.3. Preprocessing data

.scikit learn-Compare the effect of different scalers on data with outliers

.data school-Comparing supervised learning algorithms

.Lecun paper-Efficient BackProp

0.Hinton vedio-3.2 The error surface for a linear neuron

1.CS231n-Neural Networks Part 2: Setting up the Data and the Loss

2.ft p-Should I normalize/standardize/ rescale the data?

3.medium-Understand Data Normalization in Machine Learning

4.Normalization and Standardization

5.How and why do normalization and feature scaling work?

6.Is it a good practice to always scale/normalize data for machine learning?

7.When conducting multiple regression, when should you center your predictor variables & when should you standardize them?

technology

technology

"* The elliptical and circular contour plots at the beginning of the article are only applicable to linear models using mean square error. For other loss functions or more complex models, such as deep neural networks, the error surface of the loss function may be complex and It ca" Related video

213 - Ensemble of networks for improved accuracy i...

25:31

C4W1L08 Simple Convolutional Network Example...

8:35

How to handle imbalanced datasets in Python...

11:48

Multivariate probability density, contour plot EDA...

8:38

Lecture 7: Convolutional Networks...

1:08:53

Beginner Intro to Neural Networks 5: Squared Error...

8:26

Neural Networks Part 8: Image Classification with ...

15:24

Lecture 9 - Approx/Estimation Error & ERM | Stanfo...

1:26:03

Machine learning - Unconstrained optimization...

1:16:19

Neural Network Training (Part 2): Neural Network E...

14:51

technology Category Latest News

technology

The "Telecommunication Equipment Network Access License" is an access license for the Ministry of Industry and Information Technology to enter the network to ensure that the telecommunication equipment can operate safely and reliably in the telecommunication network, and also pro

Xunyi Technology integrated communication dispatcher obtained the Ministry of Industry and Information Technology Telecommunications Equipment Network Access License
07/02 1234

technology

Professor Yuan Zhihao, Professor Zhang Chenguang, EnSM: Intelligent current collector enables supercapacitors to achieve high energy density and electrochromic contrast at the same time and its application for intelligent wearable power supply [Article Information] Intelligent cu

Achieve high energy density and high electrochromic contrast and its application for intelligent wearable power supplies
07/02 1242

technology

I don’t know if my friends have paid attention to Apple’s latest iPad, but Lao Hu has been paying attention. Among them, the iPad 10 has changed the most, which has replaced two things that have not changed for thousands of years. One is the HOME key and the other is the Lightnin

iPhone 15 Forced Port C? Apple says no...
07/02 1252

technology

Through real-time monitoring of important environmental parameters, it is possible to promptly discover potential factors that can pose a threat to the operation of on-site machines and promptly discover faults, which can play a great role in the long-term and stable operation of

What does environmental monitoring based on 51 microcontroller include?
07/02 1998

technology

Friends who like music have high requirements for sound quality. Ordinary smart speakers are not satisfactory in terms of sound quality and effect. Huawei's SoundX Gild Theater Edition created by Huawei brings theater-level sound quality, allowing you to enjoy an immersive and sh

How to choose a Double 11 smart speaker? Huawei Sound X Gild Theater Edition is worth catching up
07/02 1581

technology

I believe many people will have such troubles. They have opened a Pinduoduo online store and cannot operate it themselves. There are so many agents on the market, and I don’t know how to choose.

How to find an agent for Pinduoduo store? Choose directly from the TOP10 agency operation
07/02 1148

technology

Huawei's life in recent years is difficult. First of all, the construction of 5G networks has been sanctioned for no reason by the United States. Trying to win over so-called allies and terminate cooperation with Huawei 5G is also a damage to Huawei's orders. Secondly, Huawei's 3

Huawei sued Xiaomi, netizens said: "Brothers, settle accounts clearly"
07/02 1300

technology

When doing Google SEO, you know that you need to buy external links, but the key is how to buy links? Answer: Through the calculation of the purchase of external links, we need to know why we need to purchase external links for Google rankings? Is it okay to not post external lin

How to buy a link to purchase? How many external links are suitable for purchasing at one time
07/02 1219

technology

#Toutiao Creation Challenge#This year's Mid-Autumn Festival, Phoenix.com's first virtual spokesperson, "Jiuli", once again appeared on the Bund landmark screen, sending everyone Mid-Autumn Festival blessings. You may not know that the scale of China's core virtual idol industry r

Who is behind the virtual anchor "manipulating"?
07/02 1398

technology

hello! Hello everyone! I am Chen *yu from Xi'an Huijie Java Big Data Class 2211 to share with you my feelings and experiences during Huijie's learning process. I am very happy to learn with you every day to make progress!

Huijie Study Star | Student Chen: Lay a solid foundation with a focus on the present and be confident in the future
07/02 1944

Recommended

As we all know, Musk and Twitter's acquisition case have been tug-of-war for half a year. Musk changed his mind every few days. Sometimes he said he wanted to buy it, sometimes he said he would not buy it, sometimes he said he would buy it at the original price, and then he said

It is inevitable to add a link to a new store. After the link is uploaded, the merchant still needs to promote the link, otherwise where will the new link be exposed and converted? Then can the new store be linked without ordering? Next, we will explain the content in this regard

Before the article started, Brother Ji had a question: How many people are buying Xiaomi mobile phones now for MIUI? The vocabulary that comes to mind should be rich in features, update and maintain conscience?

Huawei Hongmeng 3.0 has been released for more than a month. In addition to the new Huawei phone released at that time, it was equipped with the official version of Hongmeng 3.0, several models are now available to usher in Hongmeng 3.

Recently, TSMC announced at the OIP Forum that it will launch the 3DFabric Alliance, which will be the first innovative alliance in the semiconductor industry to accelerate the 3D IC ecosystem with partners, providing a comprehensive range of first-class solutions and services fo

In recent years, smart canteens have been widely accepted and the smart canteen system has been rapidly used in education, medical care, government agencies, the Internet and other industries. Smart canteens have received widespread attention and love in university canteens. The

First of all, I don’t understand the specific technical implementation level, let’s talk about several requirements at the operation level 1. SDK generally includes but is not limited to the following systems: the stability of the update system (version update) is the main reason

As we all know, Tesla founder Musk can brag and cheat, deceive and draw big cakes, and he is very capable, and he can really make all the cakes he draws. Recently, Musk was proud. Forbes released the 2022 US Top 400 list, and his net worth reached US$251 billion. For the first ti

There is a need for detection for the personnel of the Internet of Things. PIR infrared sensors are cheap, but the disadvantage is that they cannot detect static personnel. When the personnel in the room move weakly or are still, they will cause detection errors. The millimeter-w

Many friends like to secretly browse pornographic websites in the middle of the night, but all of this cannot escape the "eyes" of big data. Many people are panicked about this. Can you tell if you watch a "small video" at home?

Hot

1 As the electric vehicle market continues to expand, having enough electric vehicle charging stations is crucial to extending mileage and reducing charging waiting time. At present, there are about 140,000 public electric vehicle charging piles in the United States among nearly 53

2 This simple-looking man, named Jin Yicai, is a courier No. 001 of JD Logistics. Although on the surface, he seems to have nothing special except that he has joined the company earlier than others and has a deeper qualification.

3 In recent years, technologies such as the Internet, big data, and cloud computing have accelerated innovation, and consumers pursue simplification in the consumption process. The optimization and upgrading of consumption scenarios have made consumption payments more convenient. A

4 There is a need for detection for the personnel of the Internet of Things. PIR infrared sensors are cheap, but the disadvantage is that they cannot detect static personnel. When the personnel in the room move weakly or are still, they will cause detection errors. The millimeter-w

5 Intelligent Operation and Maintenance (AIOps): The application of artificial intelligence technology (such as machine learning, etc.) and data science on IT operation issues, used to enhance and partially replace major IT operation functions. According to Gartner, AIOps extracts

6 Apple's iPhone 14 series this year, all four models are equipped with 6GBRAM. The only difference is that the Pro version model has been upgraded to the updated LPDDR5 specification, while the regular model continues the previous LPDDR4X.

7 According to the latest news, the annual drama of tech tycoon Musk and Twitter has finally come to an end. Tesla CEO and Starlink boss Elon Musk still achieved his best goal and acquired Twitter. Although he regretted turning against each other several times during this process,

8 With the rapid development of the Internet, communication software and social software are emerging one after another. WeChat and QQ are now mainstream social software on the market, but any software has its security risks, especially the confidential information exchanged betwee

9 I don’t know if you still remember that on September 6, three years ago, in Berlin, Germany, Yu Chengdong took out the world’s first flagship 5G SoC Kirin 990 5G. At this moment, everyone believes that the 5G era is coming. The giant who once played the role of a chaser in the hi

10 A hastily organized meeting was arranged at an Airbnb to entertain the richest man in the world. Binance CEO CZ Zhao Changpeng posted on social media, and Binance officially became one of the Twitter investors.

technology video recommendation

Interview with the chief designer of China’s Lunar...

3:26

The power of culture | Jixiang Shan | TEDxPazhou...

17:45

World Heritage Committee Opens 44th Session in Eas...

1:11

Greetham Says There Is a 'Massive Fiction' on Chin...

2:18

The Rise of China and Its Energy Implications: Wel...

2:10:48

In New York's Chinatown, New Digital Generation Ra...

4:36

VOICES: From digital culture, gaming to social med...

2:56

State of Asia 2022: The State of Technology...

30:38

Xi Calls for Preserving Chinese Culture During Vis...

2:54

Is China undergoing a new 'cultural revolution'? |...

23:01

Too much news to read? Try searching!

Copyright © DayDayNews