Images, voice, and text can be processed, and the effect is still good. In terms of CV, it even surpasses the many models including MAE and MaskFeat.

2025/06/2614:35:36 hotcomm 1515

line early mailed from Aofei Temple

qubit | Official account QbitAI

Meta AI has developed a unified self-supervised learning model Data2vec. How to unify

?

images, voice, and text can be processed, and the effects are all good. In terms of CV, it even surpasses the many models including MAE and MaskFeat. How does

do this? Let’s take a look at the ideas and structure of Data2vec.

Data2vec How to unify the picture and text

Regarding this issue, we can see some clues from the model name.

and Word2vec convert words into computable vectors similarly, Data2vec converts different types of data into data sequences of the same form.

successfully avoids the problem of modal differences.

Then, use self-supervised learning to block part of these data, and let the model restore the blocked part through training.

and its structure is also designed on this idea.

Data2vec designed a teacher-student network structure based on the Transformer architecture:

Images, voice, and text can be processed, and the effect is still good. In terms of CV, it even surpasses the many models including MAE and MaskFeat. - DayDayNews

From the above figure, it can be seen that no matter for any form of input, it is first converted into a data sequence, and mask part of the information (or blocks the dog's head, or covers a piece of voice, or blocks a word) .

Then let the student network predict the complete input through partial visible input, and then adjust it by the teacher network to achieve the effect of a model to handle multitasking.

The next question is how to convert different types of input into the same form. How to standardize input data

Data2vec

In the standardized input, Data2vec or specific problems are analyzed for .

After all, pixels, waveforms and text are completely different forms, while Data2vec adopts different encoding strategies for different forms of input, but the purpose is the same.

That is to convert all these inputs into data sequences. The specific operation method of

is as follows:

Task

encoding method

masking method

Computer Vision

ViT image blocking

Block-wise Masking Strategy

voice

multi-layer one-dimensional convolutional neural network

Mask spans of late speech representation

Text

Preprocessing to obtain sub-word units, and then embed them into the distribution space by embedding vectors

Tokens

where ViT's encoding strategy is to divide a picture into a series of tiles, each tiling has 16x16 pixels, and then input it into an linear transformation system.

The coding method of speech is to convert a 16kHz waveform into a 50Hz data sequence using a multi-layer one-dimensional convolutional neural network.

Images, voice, and text can be processed, and the effect is still good. In terms of CV, it even surpasses the many models including MAE and MaskFeat. - DayDayNews

plus text-encoded embed vectors, so that all modal inputs are converted into data sequences, which facilitates subsequent training.

. For masking strategies, different modal expressions are also different.

For example, an image can cover a piece, but the voice and text have a contextual relationship and cannot cover a part of it casually.

Therefore, for different modes, Data2vec also adopts a corresponding masking method that conforms to different data characteristics. After standardizing

, Data2vec also made some fine-tuning for different downstream tasks. The voice and text models have been released on GitHub, and the visual model is also on the way:

Images, voice, and text can be processed, and the effect is still good. In terms of CV, it even surpasses the many models including MAE and MaskFeat. - DayDayNews

Let's take a look at how the performance of this unified model is.

performance

Although Data2vec is working together, its performance has not fallen. In terms of computer vision, the pre-training situation on IN1K is shown in the following table:

Images, voice, and text can be processed, and the effect is still good. In terms of CV, it even surpasses the many models including MAE and MaskFeat. - DayDayNews

Data2vec performs the best accuracy compared to some other models. Moreover, Data2vec only trained 800 epochs, while MaskFeat in the table trained 1600 epochs.

looks more obvious when looking at the bar chart. The blue one is Data2vec:

Images, voice, and text can be processed, and the effect is still good. In terms of CV, it even surpasses the many models including MAE and MaskFeat. - DayDayNews

In terms of speech processing, the pre-training results on LS-960 are as follows:

Images, voice, and text can be processed, and the effect is still good. In terms of CV, it even surpasses the many models including MAE and MaskFeat. - DayDayNews

It can be seen that Data2vec's word error rate is lower than wav2vec2.0 and HuBERT under different tag data volumes.

Images, voice, and text can be processed, and the effect is still good. In terms of CV, it even surpasses the many models including MAE and MaskFeat. - DayDayNews

In GLUE evaluation, Data2vec is comparable to RoBERTa in indicators such as natural language reasoning (MNLI, QNLI, RTE) , sentence similarity (MRPC, QQP, STS-B) , syntax (CoLA) , sentiment analysis (SST) and other indicators.

Baseline is the training result of RoBERTa in settings similar to BERT:

Images, voice, and text can be processed, and the effect is still good. In terms of CV, it even surpasses the many models including MAE and MaskFeat. - DayDayNews

Overall score is similar:

Images, voice, and text can be processed, and the effect is still good. In terms of CV, it even surpasses the many models including MAE and MaskFeat. - DayDayNews

In this way, a unified model architecture can really be used effectively in multiple task modes.

Although Data2vec still handles different methods in input data and masking, it is still an attempt to explore model unity.

may have a unified masking strategy and a mixed data set of different modal data in the future to achieve true unity.

Reference link:

[1]https://ai.facebook.com/research/data2vec-a-general-framework-for-self-supervised-learning-in-speech-vision-and-language
[2]https://ai.facebook .com/blog/the-first-high-performance-self-supervised-algorithm-that-works-for-speech-vision-and-text
[3]https://github.com/pytorch/fairseq/tree/main/examples/data2vec

hotcomm Category Latest News