For example, when a person has watched a tiger documentary on the Nature Channel and then heard others describe "a white-bred big cat whistling and windy", he can describe it based on this language and combine it with the previous movie results, knowing that others are describing

2025/06/2614:34:39 hotcomm 1658

Edit: So sleepy Yuan Xie

[ New Zhiyuan Introduction] Artificial Intelligence At the beginning of science, letting machines "learn like humans" have always been the goal of all practitioners. Human intelligence is based on the common processing capabilities of multiple senses and languages, and researchers have always been committed to making machines achieve this effect.

Human intelligence is the sum of "multimodal learning", that is, it can cross classification boundaries and understand and transfer information and experience from different sources or forms.

For example, when a person has watched a tiger documentary on the Nature Channel and then heard others describe "a white-bred big cat whistling and wind", he can describe it based on this language and combine it with the previous movie viewing results, knowing that others are describing tigers and will not rashly run to slide the shovel.

allows artificial intelligence to achieve the same multimodal learning effect, which is a high-challenge and high-reward work.

No matter how impressive a single algorithm that processes sound, image, and text data alone, if it cannot be used between data of different modes, it will not be as good as an algorithm after all. A single basic framework can be used for various data in image recognition, audio modal detection, and natural language processing.

and the data2vec algorithm of the Meta AI research group did it. In its blog, the research team said that in order to make machine learning closer to human intelligence, it is necessary to overcome the barriers between existing self-supervised learning algorithms on data in different modalities.

Thesis link: https://ai.facebook.com/research/data2vec-a-general-framework-for-self-supervised-learning-in-speech-vision-and-language

Open source project: https://github.com/pytorch/fairseq/tree/ main/examples/data2vec

For this reason, LeCun also issued a congratulations: "data2vec's results on ImageNet, LibriSpeech (voice recognition) and GLU (NLP) are all better than existing SOTAs."

data2vec: spanning CV, NLP and voice

Currently, mainstream artificial intelligence still relies on supervised learning based on labeled data. "Supervised Learning" such as

performs excellently in training specialized models, and often performs extremely high in the tasks they train.

However, AI with "crests" can easily fail in the field of insufficient labeling data, and it is a bit too much for scientists to carefully build "crests" for AI.

For example, researchers from various countries have done a lot of work to create large-scale tagged datasets for their country's voice and text, but it is impossible to do this for thousands of languages on Earth.

At this time, you need to use "self-supervised learning".

Self-supervision allows computers to find out the structure of images, speech or text through their own observations to understand the world without using labeled images, text, audio and other data sources. But there are large differences in the current way self-supervised learning algorithms learn from images, speech, text, and other modalities.

algorithm predicts different units for each modal: pixels or visual annotations of the image, words of text, and voice learning directory for speech.

A group of pixels is very different from an audio waveform or a piece of text. Because of this, the algorithm design has always been associated with a specific mode, which means that the algorithm operates in each mode differently.

This difference has always been an important obstacle to self-supervised learning to apply on a larger scale. Because a powerful algorithm designed to understand images cannot be applied directly to another modal, such as text, it is difficult to drive the development of several modalities at the same speed.

and data2vec is the first high-performance self-supervised algorithm suitable for multiple modalities, which can be applied to speech, images and text respectively. Its performance exceeds the previous best single-purpose algorithms for computer vision and speech, and is also competitive in NLP tasks. The proposal of

data2vec represents a new overall self-supervised learning paradigm, which not only improves the performance of the model under multiple modes, but also does not rely on contrastive learning or reconstruction of input instances.

data2vec predicts their own representation of the input data by training the model, regardless of modality.

By these representations, instead of predicting visual annotations, words or sounds, a single algorithm can process completely different types of inputs, eliminating the dependence on specific modal objectives in learning tasks.

However, before predicting the representation, it is necessary to define a normalized feature for the task that can achieve robustness under different modes.

data2vec uses a teacher model to first calculate the target representation from an image, text, or pronunciation tone. Next, the mask part is entered, the process is repeated with the student model, and then the potential representation of the teacher is predicted.

student model must predict the representation of all input data, although it only sees part of the information.

SOTA triple

Computer vision

The author pretrained data2vec on the images of the ImageNet-1K training set and fine-tuned the resulting image classification model using labeled data from the same benchmark.

For downstream tasks that require prediction of single tags for each image, the author implements it by stacking a softmax normalized classifier based on the mean pool characterization.

The results show that data2vec exceeds the previous work using ViT-B and ViT-L. Predicting contextualized potential representations in mask prediction settings performs very well compared to methods that predict local targets such as raw input pixels, engineering image features, or visual annotations.

In addition, data2vec is also better than the current self-distillation method of SOTA.

Voice Processing

Team pre-trained data2vec on 960 hours of voice audio data from Librispeech (LS-960). This dataset contains relatively clear audio from English audiobooks.

In order to understand the performance under different resource environments, the author used different amounts of labeled data to fine-tune the automatic speech recognition model, ranging from 10 minutes to 960 hours.

is compared with two speech representation learning algorithms wav2vec 2.0 and HuBERT, which rely on discrete speech units. The results show that data2vec has improved in all label data settings, with the largest gain in 10 minutes label data (relative bit error rate increases by 20%).

In addition, when rich contextualization goals are used, learning contextualization goals during pre-training can improve performance without learning discrete units.

Natural Language Processing

data2vec adopts the same training settings as BERT, pre-training on the book corpus and English Wikipedia data, with an update volume of 1 million and a batch size of 256 sequences.

Team Universal Language Understanding Assessment (GLUE) benchmarks, including tasks such as natural language reasoning (MNLI, QLNLI, RTE), sentence similarity (MRPC, QQP, and STS-B), grammarity (CoLA), and sentiment analysis (SST-2).

The author fine-tuned data2vec on the labeled data provided by each task. The results show that data2vec is better than RoBERTa's baseline.

data2vec is the first successful pre-trained NLP model, which does not use discrete units (words, subwords, characters, or bytes) as training targets, but instead predicts the context potential representations that appear from self-attention throughout the unmasked text sequence.

This makes it possible for the learning task to predict the target with specific attributes of the current text sequence, rather than a common representation of each text sequence that occurs in a particular discrete unit.

In addition, the training target is not a closed vocabulary. Thus, the model can define itself the type of target it considers appropriate.

Self-supervision: Learning

Compared with 2021 Google to achieve similar goals, Perceiver launched in July and Pathways released in October have advantages: Pathways is an industry public relations action without specific details and papers, while Perceiver is still based on traditional tagged data and supervised learning paths.

Meta AI research group said in summary research that data2vec has many possibilities to implement, allowing AI to learn skills that were too complex for machines before, such as various ways to toast and various techniques to play football through the combination of video, recording and articles.

These skills are like voice recognition of all languages on the earth. It is too expensive to teach AI by labeling data. In the future, AI uses common architectures to learn to transcend common experiences across data modes to learn from one example and apply it to other tasks. This goal brings data2vec closer.

In addition, the research team also stated: "The potential characterization variables processed by experiments are not hybrid encodings of three-modal data. We still process single modal data in a single process. However, the main innovation of this project is that data2vec's processing process of different modal data is basically the same. This is something that no one has done before, and it is closer to the human audio-visual learning process described by neurobiologists."

However, data2vec's multimodal general neural network is not without its shortcomings: it has to rely on the modal markers of the data. Data such as images, voice, and text must be preprocessed to obtain modal classification. Then feed these data types clues to data2vec. In the original words in the paper, this is called "small modal-related encoder input".

. The true human intelligence does not need to preprocess data and classify "this is the source of text, and that is the second uncle oral message."

Author introduction

Wei-Ning Hsu Xu Weining, a senior research scientist in the Meta Artificial Intelligence Research Group, graduated from MITh with his research directions: representational learning, self-supervised learning, and speech recognition.

Jiatao Gu Gu Jiatao, research scientist of Meta Artificial Intelligence Research Group, Doctor of Electronic Engineering, University of Hong Kong, and his research direction is natural language processing and deep learning.

Qiantong Xu, senior research engineer of Meta Artificial Intelligence Research Group, research directions are language modeling for sound wave modeling and dialogue modal recognition.