In recent years, self-supervised learning methods have made great progress and development, and have even achieved results that surpass supervised learning in the field of transfer generalization.

2024/06/2214:46:33 technology 1300

self-supervised learning methods have made great progress and development in recent years, and have even achieved results that surpass supervised learning in the field of transfer generalization. In this work, researchers from the basic visual intelligence team of Alibaba Damo Academy and Tsinghua University rethink the shortcomings of supervised training methods and propose a

supervised k nearest neighbor prediction based on the leave-one-out method. The learning algorithm (Leave-One-Out K-Nearest-Neighbors, LOOK)

, surpasses existing supervised and unsupervised methods in multiple downstream tasks. The current work has been published in ICLR 2022.

In recent years, self-supervised learning methods have made great progress and development, and have even achieved results that surpass supervised learning in the field of transfer generalization. - DayDayNews

Paper link:

https://arxiv.org/pdf/2110.06014.pdf

1. Background

In related fields of representation learning, the basic paradigm based on " pre-training-fine-tuning " has been widely used. This paradigm first Pre-train on a large-scale upstream general data set, and then use the trained model to fine-tune the application on a specific downstream data set. For the pre-training process, supervised training methods based on cross entropy (Cross Entropy, C.E.) are more commonly used, and model training is carried out based on sample labels to learn feature representations related to highly semantic labels.

In recent years, unsupervised representation learning that does not rely on sample labels has made great progress and development. In particular, methods based on contrastive learning have achieved performance similar to supervised methods, including target recognition, semantic segmentation, fine-grained It has achieved results that exceed supervised ones on downstream tasks such as classification; in representative unsupervised learning methods, effective analysis of the data is achieved by drawing closer between different data-enhanced versions of the same sample and pushing away between different samples. Capture of information. However, due to the lack of label assistance that is closer to human cognition, this type of method is weak in extracting high-order semantic information.

In this context, we pay attention to and rethink the existing supervised type pre-trained representation learning, and find that the reason for its limited transfer performance is that ignores the attention to intra-class differential semantic information of . Figure 1 shows two common supervised learning methods, including cross entropy (C.E.) and supervised contrastive learning (Supervised Contrastive Learning, SupCon). We use arrows to indicate the optimization direction of sample features during training. In order to distinguish different categories, both methods will bring the feature distribution of samples corresponding to the same category closer during the training process, but the implementation methods are slightly different. C.E. is implemented by constructing a parameterized center for each category, while SupCon It is to zoom in directly point to point. Through the analysis of the examples in the figure, it can be found that even for the same type of samples, there is also distribution diversity, that is, there are a large number of sample pairs of the same type but with large differences in content in . Drawing these sample pairs closer will damage the image. The natural information extraction capability of in the model causes the model to discard the semantic features that can distinguish these samples, further affecting the transfer ability on the downstream data set. This phenomenon can also be described as overfitting to the upstream data set.

In recent years, self-supervised learning methods have made great progress and development, and have even achieved results that surpass supervised learning in the field of transfer generalization. - DayDayNews

Figure 1: Comparison between existing supervised learning methods and the method in this article

2. Ideas

In order to solve the upstream over-fitting problem in supervised pre-training, this article uses the Leave-One-Out k-Nearest method. -Neighbor, LOOK) for supervised training, only closes samples with high similarity between similar samples, to avoid the reduction in transfer ability caused by forced closeness of samples with high differences within the class . The left side of Figure 2 shows the distribution effect of pre-training features based on cross-entropy loss constraints. Under the supervision of uniformly zooming in similar samples, it can be observed that each type of data presents a clear single cluster distribution. Compared with the linear classification model of the proposed method LOOK, the k nearest neighbor classifier used does not require that all samples in the same category of tend to be distributed in a single cluster . Given a certain query sample, as long as the If most of the sample labels are consistent with it, the classification task can be completed correctly.

Therefore, under this optimization goal, the majority of similar samples within the k nearest neighbor range of all training samples is sufficient, so that the category can present the multi-cluster distribution pattern .The right side of Figure 2 shows the visualization of sample feature distribution obtained based on training in this way, and the multi-cluster distribution formed by the proposed method can be clearly observed. Figure 2 further selects some samples from the clusters for display. It can be observed that even on the ImageNet data set with relatively complete category definitions, there is still the possibility of intra-class differentiation. As shown, the football helmet class can actually form two subclasses: a single helmet object and the helmet in the game photo. The harmonica class also has two subclasses: a single harmonica object and a playing harmonica, and the proposed method can also be compared. It is good to distinguish these subcategories, indicating that it retains valuable semantic information related to distinguishing these subcategories, thereby further improving the downstream transfer generalization ability.

In recent years, self-supervised learning methods have made great progress and development, and have even achieved results that surpass supervised learning in the field of transfer generalization. - DayDayNews

Figure 2: Visual comparison of features and samples between the method LOOK and the cross-entropy method (C.E.) in this article

3. Method

3.1 LOOK: k-nearest neighbor supervised learning based on the leave-one-out method

Considering the upstream large-scale data set pre-training scenario, let The upstream data set is, which contains a sample to be learned, and the corresponding label set represents the category of the data set sample; the model that needs to be trained can be expressed as a mapping function, which can map the sample to the high-dimensional space representation.

For the training sample and the corresponding representation, it is set to the nearest adjacent sample in the data set, and the current sample category is predicted based on this:

In recent years, self-supervised learning methods have made great progress and development, and have even achieved results that surpass supervised learning in the field of transfer generalization. - DayDayNews

where is the aggregation weight represented by the cosine distance, and is a one-hot vector whose total dimension is the number of categories ( position has a value of 1, and the remaining positions have a value of 0). On this basis, the Softmax function with temperature is used to regularize the label aggregation results to a sum of 1, and the negative logarithm function can be further used to construct a loss function:

In recent years, self-supervised learning methods have made great progress and development, and have even achieved results that surpass supervised learning in the field of transfer generalization. - DayDayNews

where the Softmax function controls the sharpness of the normalization process. Hyperparameter, is an identifier, takes 1 if and only then, and takes 0 in the remaining cases.

Through the above loss function, the model will bring the characteristics of connected similar samples closer during the training process, while the characteristics between heterogeneous nodes will be pushed farther away. However, it should be noted that during the training process of neural network based on iterative parameter update, it is required to continuously dynamically update the neighbor graph based on the current parameters, and the distance calculation and sorting function will produce greater computational complexity, especially for The large scale of the upstream data set will seriously affect the completion time of pre-training. In response to this efficiency issue, this paper designs efficient calculation and optimization methods to extend the proposed method to large-scale data sets in subsequent chapters.

3.2 Adapt LOOK to large-scale data sets

The computational problems faced by the LOOK method proposed in this article on large-scale data sets mainly include the following two points:

  • On the one hand, in the online update mode of the training model, each time The computational cost of traversing all dataset samples for feature re-extraction after updating is unbearable, which makes it necessary to deal with the problem of mismatch between the model used for feature extraction and the current latest model when calculating the distance between samples;

  • On the other hand, Due to the large scale of the data set, directly calculating the k nearest neighbors of the current sample for the entire data set will also cause huge computational consumption. Therefore, whether the above calculation can be approximated through a smaller subset is key to solving this problem. of.

This chapter aims at the above problems and achieves efficient LOOK algorithm learning on large-scale data sets from the following perspectives.

(1) Search space construction

Since nearest neighbor search of the entire data set is very time-consuming, this paper explores ways to construct a smaller search subspace . The sub-search space should meet two conditions: the

  • search space should be as large as possible to achieve coverage of the complete data set; the sample features included in in the

  • search space should be time-series synchronized to ensure the distance measurement between samples rationality.

In order to meet the above needs, this article introduces the momentum queue mechanism proposed in Momentum Contrast Learning (MoCo). That is, during the training process of , a first-in, first-out sample queue is dynamically maintained based on each batch of training samples, and several recently updated ones are retained. Sample . In order to maintain the temporal synchronization of sample features in the queue, the model that generates features no longer uses the current real-time training updated model, but additionally maintains an momentum model, whose update movement speed is significantly lower than the real-time model, so it can be maintained approximately The timing synchronization of samples in the queue can provide a larger approximate synchronization search space.

(2) Predictor-based fast convergence optimization

When using the momentum queue search space, there will be a problem of too slow convergence. This is because the proposed algorithm needs to zoom in on the features of nearby samples, resulting in real-time model and The zoom-in effect between momentum models greatly slows down the update speed of real-time models. In order to solve this problem, this paper adds a predictor model structure composed of multi-layer perceptron MLP after the real-time model to provide a buffer between the two models, so that it can avoid too slow convergence caused by the direct close effect of the momentum model. .

(3) Dynamic adjustment of nearest neighbor hyperparameters

In the design of the proposed method, the range of neighbor graphs and their aggregated temperature hyperparameters have a great impact on the training process, and the demand for these hyperparameters is actually at different training stages. are different: in the early stage of training, the distribution of sample points is relatively scattered and random for categories. If the definition range of nearest neighbors is too small, there will be no similar nodes in the range, and then there will be only the sample push-away effect, which will affect the model convergence speed. ; In the middle and late stages of training, basic similar aggregation effects have appeared between samples. In this case, it is necessary to reduce the aggregation range of the neighbor graph to avoid bringing a large number of similar and different samples into close proximity at the same time, so as to form the intra-class described in the motivation. Multi-cluster distribution model. Based on the above analysis, this paper uses a dynamic attenuation strategy for nearest neighbor hyperparameters to ensure that needs can be met at different stages of training.

4. Experiment

4.1 Comparison of migration performance experimental results

In recent years, self-supervised learning methods have made great progress and development, and have even achieved results that surpass supervised learning in the field of transfer generalization. - DayDayNews

Table 1: Linear migration results on multiple downstream data sets

In recent years, self-supervised learning methods have made great progress and development, and have even achieved results that surpass supervised learning in the field of transfer generalization. - DayDayNews

Table 2: Complete training results on multiple downstream data sets

The above results show that the proposed method LOOK has many advantages. Experimental results superior to existing supervised and unsupervised methods have been achieved in the transfer task on each data set.

In recent years, self-supervised learning methods have made great progress and development, and have even achieved results that surpass supervised learning in the field of transfer generalization. - DayDayNews

Table 3: Experimental results based on different downstream migration algorithms

The above results show that the proposed method can maintain stable performance improvement even when using more complex and advanced downstream migration algorithms.

4.2 Comparative experiment

In recent years, self-supervised learning methods have made great progress and development, and have even achieved results that surpass supervised learning in the field of transfer generalization. - DayDayNews

Table 4: Comparative experiment on queue length, momentum hyperparameters and k-nearest neighbor range

The above results are the average of linear migration results on 9 downstream data sets. The results show that the proposed method shows robustness to hyperparameter settings and exhibits optimal performance under appropriate k-nearest neighbor hyperparameters.

4.3 Experimental results of migration without training

In addition to the conventional migration method, this paper also explores the migration method without training, that is, by only updating the sample feature pool, the k-nearest neighbor algorithm is used for prediction downstream. The experimental results demonstrate the superiority of the proposed method in this way. In addition, this part of the experiment can also provide reference for subsequent related work.

In recent years, self-supervised learning methods have made great progress and development, and have even achieved results that surpass supervised learning in the field of transfer generalization. - DayDayNews

Table 5: Experimental results of migration without training

4.4 Feature visualization analysis

In recent years, self-supervised learning methods have made great progress and development, and have even achieved results that surpass supervised learning in the field of transfer generalization. - DayDayNews

Figure 3: Feature visualization comparison

From the t-SNE visualization results in the above figure, it can be seen that the proposed method presents obvious multiple clusters compared with existing methods. and loose feature distribution, consistent with motivation.

5. Conclusion

This paper rethinks the existing supervised learning algorithms, and proposes the leave-one-out k-nearest neighbor pre-training method (LOOK) to address the problem of reduced generalization caused by overfitting of upstream data and neglect of intra-class differences. , and optimized for learning efficiency in large-scale data sets.Experimental results show that LOOK has achieved significant improvements in downstream transfer tasks compared with existing methods. The learned representations can form multi-cluster distribution patterns related to intra-class differences, improving the generalization transfer ability of the model.

Author: Feng Yutong Jiang Jianwen

Illustration by Viktoriya Belinio from icons8

-The End-

New this week!

In recent years, self-supervised learning methods have made great progress and development, and have even achieved results that surpass supervised learning in the field of transfer generalization. - DayDayNews

Scan the code to watch!

"AI Technology Flow" original submission plan

TechBeat is an AI learning community established by Jiangmen Venture Capital (www.techbeat.net). The

community has launched 330+ talk videos and 900+ technical articles, covering CV/NLP/ML/Robotis, etc.; it holds regular conferences and other online communication activities every month, and occasionally holds offline gatherings and communication activities for technical people. . We are working hard to become a high-quality, knowledge-based exchange platform favored by AI talents. We hope to create more professional services and experiences for AI talents, accelerate and accompany their growth.

Submission content

// Latest technology interpretation/systematic knowledge sharing //

// Cutting edge information explanation/experience experience //

Instructions for submission

Manuscripts must be original articles and indicate the author's information.

We will select some articles that are in the direction of in-depth technical analysis and scientific research experience to inspire users and create original content rewards.

Submission method

Send an email to

[email protected]

or add staff WeChat (chemn493) to submit , to communicate the submission details; you can also follow the "Jiangmen Venture Capital" public account and reply " submission " in the background to get submission instructions.

Please add staff WeChat when submitting!

About me " gate "

Jiangmen is a new venture capital institution that focuses on discovery, acceleration and investment in technology-driven startups . It covers Jiangmen innovative services and Jiangmen Technology Society. Group and Jiangmen Venture Capital Fund .

Jiangmen was founded at the end of 2015. The founding team was built by the original founding team of Microsoft Ventures in China. It has selected and deeply incubated 126 innovative technology startups for , Microsoft, and .

If you are a start-up in the technology field, you not only want to obtain investment, but also want to obtain a series of continuous and valuable post-investment services.

welcome to send or recommend projects to my "door":

In recent years, self-supervised learning methods have made great progress and development, and have even achieved results that surpass supervised learning in the field of transfer generalization. - DayDayNews

⤵ One-click to send you to TechBeat Happy Planet

technology Category Latest News