line early mailed from Aofei Temple
qubit | Official account QbitAI
Meta AI has developed a unified self-supervised learning model Data2vec. How to unify
?
images, voice, and text can be processed, and the effects are all good. In terms of CV, it even surpasses the many models including MAE and MaskFeat. How does
do this? Let’s take a look at the ideas and structure of Data2vec.
Data2vec How to unify the picture and text
Regarding this issue, we can see some clues from the model name.
and Word2vec convert words into computable vectors similarly, Data2vec converts different types of data into data sequences of the same form.
successfully avoids the problem of modal differences.
Then, use self-supervised learning to block part of these data, and let the model restore the blocked part through training.
and its structure is also designed on this idea.
Data2vec designed a teacher-student network structure based on the Transformer architecture:
From the above figure, it can be seen that no matter for any form of input, it is first converted into a data sequence, and mask part of the information (or blocks the dog's head, or covers a piece of voice, or blocks a word) .
Then let the student network predict the complete input through partial visible input, and then adjust it by the teacher network to achieve the effect of a model to handle multitasking.
The next question is how to convert different types of input into the same form. How to standardize input data
Data2vec
In the standardized input, Data2vec or specific problems are analyzed for .
After all, pixels, waveforms and text are completely different forms, while Data2vec adopts different encoding strategies for different forms of input, but the purpose is the same.
That is to convert all these inputs into data sequences. The specific operation method of
is as follows:
Task | encoding method | masking method | ||
Computer Vision | ViT image blocking | Block-wise Masking Strategy | ||
voice | multi-layer one-dimensional convolutional neural network | Mask spans of late speech representation | ||
Text | Preprocessing to obtain sub-word units, and then embed them into the distribution space by embedding vectors | Tokens |
where ViT's encoding strategy is to divide a picture into a series of tiles, each tiling has 16x16 pixels, and then input it into an linear transformation system.
The coding method of speech is to convert a 16kHz waveform into a 50Hz data sequence using a multi-layer one-dimensional convolutional neural network.
plus text-encoded embed vectors, so that all modal inputs are converted into data sequences, which facilitates subsequent training.
. For masking strategies, different modal expressions are also different.
For example, an image can cover a piece, but the voice and text have a contextual relationship and cannot cover a part of it casually.
Therefore, for different modes, Data2vec also adopts a corresponding masking method that conforms to different data characteristics. After standardizing
, Data2vec also made some fine-tuning for different downstream tasks. The voice and text models have been released on GitHub, and the visual model is also on the way:
Let's take a look at how the performance of this unified model is.
performance
Although Data2vec is working together, its performance has not fallen. In terms of computer vision, the pre-training situation on IN1K is shown in the following table:
Data2vec performs the best accuracy compared to some other models. Moreover, Data2vec only trained 800 epochs, while MaskFeat in the table trained 1600 epochs.
looks more obvious when looking at the bar chart. The blue one is Data2vec:
In terms of speech processing, the pre-training results on LS-960 are as follows:
It can be seen that Data2vec's word error rate is lower than wav2vec2.0 and HuBERT under different tag data volumes.
In GLUE evaluation, Data2vec is comparable to RoBERTa in indicators such as natural language reasoning (MNLI, QNLI, RTE) , sentence similarity (MRPC, QQP, STS-B) , syntax (CoLA) , sentiment analysis (SST) and other indicators.
Baseline is the training result of RoBERTa in settings similar to BERT:
Overall score is similar:
In this way, a unified model architecture can really be used effectively in multiple task modes.
Although Data2vec still handles different methods in input data and masking, it is still an attempt to explore model unity.
may have a unified masking strategy and a mixed data set of different modal data in the future to achieve true unity.
Reference link:
[1]https://ai.facebook.com/research/data2vec-a-general-framework-for-self-supervised-learning-in-speech-vision-and-language
[2]https://ai.facebook .com/blog/the-first-high-performance-self-supervised-algorithm-that-works-for-speech-vision-and-text
[3]https://github.com/pytorch/fairseq/tree/main/examples/data2vec