Images, voice, and text can be processed, and the effect is still good. In terms of CV, it even surpasses the many models including MAE and MaskFeat.

2025/06/2614:35:36 hotcomm 1515

line early mailed from Aofei Temple
qubit | Official account QbitAI

Meta AI has developed a unified self-supervised learning model Data2vec. How to unify

images, voice, and text can be processed, and the effects are all good. In terms of CV, it even surpasses the many models including MAE and MaskFeat. How does

do this? Let’s take a look at the ideas and structure of Data2vec.

Data2vec How to unify the picture and text

Regarding this issue, we can see some clues from the model name.

and Word2vec convert words into computable vectors similarly, Data2vec converts different types of data into data sequences of the same form.

successfully avoids the problem of modal differences.

Then, use self-supervised learning to block part of these data, and let the model restore the blocked part through training.

and its structure is also designed on this idea.

Data2vec designed a teacher-student network structure based on the Transformer architecture:

From the above figure, it can be seen that no matter for any form of input, it is first converted into a data sequence, and mask part of the information (or blocks the dog's head, or covers a piece of voice, or blocks a word) .

Then let the student network predict the complete input through partial visible input, and then adjust it by the teacher network to achieve the effect of a model to handle multitasking.

The next question is how to convert different types of input into the same form. How to standardize input data

Data2vec

In the standardized input, Data2vec or specific problems are analyzed for .

After all, pixels, waveforms and text are completely different forms, while Data2vec adopts different encoding strategies for different forms of input, but the purpose is the same.

That is to convert all these inputs into data sequences. The specific operation method of

is as follows:

Task
encoding method
masking method
Computer Vision
ViT image blocking
Block-wise Masking Strategy
voice
multi-layer one-dimensional convolutional neural network
Mask spans of late speech representation
Text
Preprocessing to obtain sub-word units, and then embed them into the distribution space by embedding vectors
Tokens

where ViT's encoding strategy is to divide a picture into a series of tiles, each tiling has 16x16 pixels, and then input it into an linear transformation system.

The coding method of speech is to convert a 16kHz waveform into a 50Hz data sequence using a multi-layer one-dimensional convolutional neural network.

plus text-encoded embed vectors, so that all modal inputs are converted into data sequences, which facilitates subsequent training.

. For masking strategies, different modal expressions are also different.

For example, an image can cover a piece, but the voice and text have a contextual relationship and cannot cover a part of it casually.

Therefore, for different modes, Data2vec also adopts a corresponding masking method that conforms to different data characteristics. After standardizing

, Data2vec also made some fine-tuning for different downstream tasks. The voice and text models have been released on GitHub, and the visual model is also on the way:

Let's take a look at how the performance of this unified model is.

performance

Although Data2vec is working together, its performance has not fallen. In terms of computer vision, the pre-training situation on IN1K is shown in the following table:

Data2vec performs the best accuracy compared to some other models. Moreover, Data2vec only trained 800 epochs, while MaskFeat in the table trained 1600 epochs.

looks more obvious when looking at the bar chart. The blue one is Data2vec:

In terms of speech processing, the pre-training results on LS-960 are as follows:

It can be seen that Data2vec's word error rate is lower than wav2vec2.0 and HuBERT under different tag data volumes.

In GLUE evaluation, Data2vec is comparable to RoBERTa in indicators such as natural language reasoning (MNLI, QNLI, RTE) , sentence similarity (MRPC, QQP, STS-B) , syntax (CoLA) , sentiment analysis (SST) and other indicators.

Baseline is the training result of RoBERTa in settings similar to BERT:

Overall score is similar:

In this way, a unified model architecture can really be used effectively in multiple task modes.

Although Data2vec still handles different methods in input data and masking, it is still an attempt to explore model unity.

may have a unified masking strategy and a mixed data set of different modal data in the future to achieve true unity.

Reference link:

[1]https://ai.facebook.com/research/data2vec-a-general-framework-for-self-supervised-learning-in-speech-vision-and-language
[2]https://ai.facebook .com/blog/the-first-high-performance-self-supervised-algorithm-that-works-for-speech-vision-and-text
[3]https://github.com/pytorch/fairseq/tree/main/examples/data2vec

hotcomm

Some netizens posted a message saying that they met Liu Kaiwei while running and also took videos, but due to other reasons, they could only post pictures. Currently, Liu Kaiwei is participating in the recording of "Break Through Thorns", and has ushered in new developments in hi

After more than three years of divorce from Yang Mi, Liu Kaiwei has been rumored to have a new relationship and is intimate in walking with You Jingru.

06/26 1765

I was very lucky to have been with her for 101 days, and my life changed because of this. At first glance, many people may think this is a confession between a man and a woman about love.

You Jingru: I am very lucky. I have been with her for 101 days, and my life has changed.

06/26 1881

In fact, just listening to this drama concept, everyone knows that this is definitely a positive energy TV series with a strong root. For the actors participating in this drama, as long as they perform well, it will definitely be a real and good cake.

Actresses are accused of green tea when they want to survive. How terrible is the female competition that can be seen everywhere on the Internet?

06/26 1891

There is no earth-shaking melon, let’s taste the weird things that are watching flowers in the fog - 01 Why is it rumored to be a male mistress? In the past two days, the "Light-Light script" has been archaeologically. Nothing else, it’s just that Liu Kaiwei, who has always shown

When the rapper knowknow shows his love, everyone is afraid of getting married?

06/26 1878

Although "The Female Prince" starring You Jingru, Jiang Chao, Wang Anyu and others has ended, I don't deny that it is an open-ended ending, but I have reason to think that this is a BE ending that is forgotten in the world. This ending is considered an unfinished ending. Moreover

The ending of "The Female Prince" is unfinished? The heroine Zhu Zhi, left with him after her wedding, and returned to missed the fifth prince

06/26 1774

Today, the costume light comedy "The Female Crown Prince" will premiere at 20:00 tonight. The drama is directed by director You Dazhi and starring You Jingru, Jiang Chao, Wang Anyu and others. It mainly tells the story of Han Yuanniang, the daughter of Dingguo Duke. Because her t

The costume drama "The Female Prince" is on the air, and You Jingru's men's clothing makes her look sensuous and invisible

06/26 1388

In these film and television works, we can also see some celebrities dressed as women or dressed as men. At the same time, we can also find that some people dressed as the opposite sex and still look very good.

"The Female Prince" is on the air, and You Jingru is called "disguised as a man", which is very contrasted with Ju Jingyi

06/26 1124

In film and television works, there are often some role settings for women to dress up as men, which is quite challenging for many actresses. Under our normal understanding, there are generally two ways for actresses to play the image and characteristics of a man. One is that the

Stop posting a stubble to insult the audience. These 6 women can "betray" you, which is so handsome

06/26 1781

It's very difficult to be a qualified professional TV series now. I've been used to watching fairy tale dramas for a few years, and occasionally I was washed by anti-Japanese dramas. The urban workplace drama that finally became popular has just caught the eyes of the audience's

"The Female Prince" You Jingru's secrets of 108 successful playboys in a fancy way

06/26 1635

"The Female Prince" has been aired, and a rookie is added to the crowded summer costume dramas. Just by listening to this name, you can probably guess that she is dressed as a man again. When it comes to dressing up as a man or men's and women's clothing, there are not many peopl

"The Female Prince" is on the air, and You Jingru disguises herself as a man and beats Ju Jingyi. Netizen: What's wrong with Jiang Chao

06/26 1844

Task	encoding method	masking method
	Computer Vision		ViT image blocking	Block-wise Masking Strategy
	voice	multi-layer one-dimensional convolutional neural network	Mask spans of late speech representation
Text		Preprocessing to obtain sub-word units, and then embed them into the distribution space by embedding vectors	Tokens

Data2vec How to unify the picture and text

Data2vec

performance

hotcomm Category Latest News

hotcomm video recommendation

USU and China trade spotlight...

Underground market-demo video...

What is the Company behind the WLGS Stock Ticker? ...

Welcome to our production workshop！...

China's 'uneven' recovery will show in market perf...

What lead to Chinese stock selloff?...

Meet the CEO: Wang Ming Chieh Explains the Vision ...

China is a short to medium-term trade, says Highto...

Stocks Rally fro the Week on Fed Rate-Cut Optimism...

Biden Slams Trump For Wanting Stock Market To 'Cra...