Side view of an astronaut is walking through a puddle on mars.

2025/05/2920:11:36 hotcomm 1409

Editor: LRS

[ New Zhiyuan Introduction] Just finished being a painter, can ordinary people become directors again?

text-based image generation model has a stunning effect. It can be said that it is the hottest research field of AI, and insiders and outsiders can watch it.

If the photo moves , will the effect be more cibpunk ?

Recently, a paper from Google contributed by ICLR 2023 has caused another stir in the generative model industry. In addition to making the photos move, the Phenaki model proposed in the article can also add plot to the text description to make the video content richer.

Paper link: https://openreview.net/forum?id=vOEXS39nOF

For example input text:

A photorealistic teddy bear is swimming in the ocean at San Francisco.
A realistic teddy bear is swimming in the sea of San Francisco .
A realistic teddy bear is swimming in the sea of San Francisco .
A realistic teddy bear is swimming in the sea of San Francisco .
A realistic teddy bear is swimming in the sea of San Francisco .
The teddy bear goes under water.
The teddy bear enters the water.
The teddy bear keeps swimming under the water with colorful fishes.
The teddy bear keeps swimming under the water with colorful fishes.
A panda bear is swimming under water.
A giant panda swimming under water.
A giant panda swimming under water.

If the previous one is reasonable, I can't hold back when I saw finally turning into giant panda .

This reverse and plays short video platforms will not have millions of likes, and the Douban score will be 9.9, and 0.1 points will be deducted. I am afraid that you will be proud.

Another example can still restore script perfectly.

Side view of an astronaut is walking through a puddle on mars
Silhouette of an astronaut walking through a puddle on mars
The astronaut is dancing on mars
The astronaut walks his dog on mars
The astronaut walks his dog on mars
The astronaut and his dog watch fireworks
Astronaut and his dog watch fireworks

One person and one dog, outer space, what's going on?

Compared with the text-guided image generation model, has higher computational cost for generating videos , high-quality text-video training data is also much less , and the input video lengths are uneven , etc., it is more difficult to generate videos directly from text.

To solve these problems, Phenaki introduces a new model to learn video representation, compress the video and characterize it with discrete tokens. Tokenizer uses causal attention to process videos of different lengths in the time dimension, and then uses a pre-trained bidirectional mask Transformer model to encode the text and directly generate the video.

In order to solve the data problem, the researchers proposed a joint training method that uses a large amount of text-image corpus and a small amount of text-video corpus to achieve better generalization performance.

Compared with the previous video generation method, Phenaki supports text stories in any field, the plot can change over time and can generate videos of any length.

This is also the first time that has studied the generation of video from time-variable text prompts, and the video encoder /decoder proposed in the article is better than other models in both space and time.

From text to video

Essentially, although a video is an image sequence, it is not easy to generate a long and coherent video.

The image field is not short of training data . Datasets such as LAION-5B, FFT4B, etc. all include billions of text-image data pairs, while text-video data sets such as WebVid have only about 10 million videos, which is far from supporting video generation in open fields.

From the perspective of computing power, the training and inference image generation model have almost squeezed out the performance of GPU. Whether the computing space can be extruded and left to the video generation decoder is also a problem to be solved.

There is another difficulty in generating the video task of text-guided. A small piece of text may be enough for image generation to describe the details , but it is far from enough for a long video , and the video includes the context, that is, the generation of the next clip requires the current clip as a condition, and the story gradually unfolds over time.

Ideally, the video generation model must be able to generate videos of any length, and at the same time have the ability to use the generated frame at a certain moment as a conditional text prompts at the current moment, which will change with time steps.

This ability can clearly distinguish videos from moving images and open up the way for real-world creative applications such as art, design and content creation.

Before this, story-based conditional video generation was a field that had never been explored, and this was the first paper to move towards this goal.

It is impossible to learn video generation directly from the data because there is no story-based dataset to learn.

To achieve this goal, the researchers designed two components for the Phenaki model, an encoder-decoder model is used to compress videos into discrete embeddings, and an Transformer model , which translates text embeddings into video tokens, where the text vector is encoded by the pretrained model T5X.

. Encoder-decoder video model: C-VIVIT

The main problem that this module needs to solve is how to obtain compressed characterization of video. Previous work on text to video either encode each frame of image, but there are restrictions on the length of the video; or using a fixed-length video encoder, it is impossible to generate variable-length videos.

C-ViViT is a causal variant of ViViT. The model architecture is specially adjusted for video generation tasks. It can compress videos in time and space dimensions while maintaining autoregression in time dimensions, thus allowing autoregression to generate videos of any length.

First removes the [CLS] tag in the space and time Transformer, and then uses the time Transfomrer for all spatial tokens calculated by the space encoder, which is different from the single time Transformer running on the [CLS] tag in ViViT.

The most important thing is that the ViViT encoder requires a fixed-length video input because it uses all-to-all attention in time. After replacing it with causal attention, the C-ViViT encoder becomes autoregressive and allows the number of input frames to be variable.

. Using bidirectional Transformers to generate videos from text

can treat the task of text to video as a sequence-to-sequence problem. To predict the video corresponding to the input text vector tokens

Most seq-to-seq models use autoregressive Transformer to predict images or video tokens in order according to the encoded text characteristics, that is, the sampling time and the sequence length are linearly related to , which is unacceptable for the generation of long videos.

Phenaki uses a masked bidirectional Transformer to reduce sampling time through a small and fixed sampling step without considering different video sequence lengths. Bidirectional Transfomrer can simultaneously predict different video tokens

In each training step, first randomly select a mask ratio from 0 to 1, and randomly replace part of token

according to the video length. Then, according to the given text vector and unmasked video tokens, the model parameters are learned by minimizing the cross entropy loss of mask tokens.

In the inference process, firstly mark all video tokens as special words [MASK], and then in each inference step, all masked (unknown) video tokens are predicted in parallel according to the text vector and unmasked (to be predicted) video tokens tokens, in each sampling step, a predicted token ratio will be selected, and the remaining tokens will be remasked and re-predicted in the next step.

For the reasoning and autoregressive generation of long videos, pre-training (classifier-free) guidance is used to control the consistency between generation and text conditions.

Once the first video is generated, the last K generated frames in the last video can be encoded using C-ViViT to automatically recursively infer other frames.

MaskGIT is initialized using the token calculated by the C-ViViT encoder, and continues to generate the remaining video tags conditioned on text input.

During video inference, the text conditions can be the same or different, which also enables the model to dynamically create a visual transition between the visual content of the previous and current text conditions, effectively generating a visual story described by the input text.

Finally, the researchers trained on 15 million text-video pairs, 50 million text-image pairs, and 400 million mixed corpus LAION-400M, and the final Phenaki model parameter volume was 1.8 billion. 1 million steps were trained with a size of 512, which took less than 5 days, and 80% of the training data came from the video dataset.

On the qualitative evaluation of visual , you can see that the model has a high degree of control over the characters and background dynamics in the video, and the appearance and style of the video can also be adjusted through text prompts (for example, ordinary video, cartoon or pencil drawing )

In quantitative comparison, Phenaki achieved the equivalent generation quality of other models under the zero-shot settings.

When considering the impact of training data, it can be found that there is a performance trade-off between models trained with only video and models trained with more image data.

Reference:

https://phenaki.video/