Previously, through AI tools such as DALL-E, MidJourney and CrAIyon, ordinary users could enter simple text content and create artistic illustrations through artificial intelligence.

2025/05/2000:20:35 technology 1246

Image source https://www.midjourney.com/showcase/

When we are immersed in Douyin, Kuaishou, and lie down while eating snacks, this world is quietly refreshing our cognition. Previously, through AI tools such as DALL-E, MidJourney and CrAIyon, ordinary users could enter simple text content and create artistic illustrations through artificial intelligence. Recently, Meta and Google have taken a step further on this basis and have successively launched black technology for text and voice-generating videos.

#Meta

Meta's Make-A-Video can not only generate pictures, but also generate video content with both voice and emotion. Based on the text information entered by the user to depict a certain scene, a matching short video is generated.

Sample website: https://make-a-video.github.io/

#Google

In addition to Meta, Google also offered two video generation competitions at the end of the holiday - Imagen Video and Phenaki. According to Google CEO Sundar Pichai, Imagen Video has higher resolution than Meta's Make-A-Video, and can generate 1280*768 video segments, 24 frames per second.

Sample website: https://imagen.research.google/video/

and Phenaki can generate videos of more than 2 minutes based on text descriptions of about 200 words, telling a complete story, comparable to a small director.

sample website: https://phenaki.video/

What technology is relying on behind it? The model architecture of

Make-A-Video Meta

Make-A-Video is as follows. This technology is improved based on the original Text-to-Image. The main motivation is to understand what the world looks like, and describe the text image data paired with it, and learn the camera movement when recording videos in the real world from unsupervised videos.

First, the author decoupled the full temporal U-Net and attention tensor and approximate them both spatially and temporally. Secondly, the author designed a spatiotemporal pipeline to generate high-resolution and frame rate videos, which include a video decoder, interpolation model and two super-resolution models, which can implement various text generation applications including Text-to-Video.

From source paper: https://arxiv.org/pdf/2209.14792.pdf

Make-A-Video shows that given the input text x translated by prior P into image embedding and the required frame rate f ps, the decoder Dt generates 16 frames with 64 × 64 resolution, and then interpolates them to a higher frame rate through ↑F and increases the resolution to SRt l to 256 × 256, SRh to 768 × 768, and finally generates video y^ with high spatiotemporal resolution.

Imagen Video Google

Imagen Video is based on the recent fire spread model, directly inheriting the image to generate the SOTA model Imagen. In addition to the high resolution,

also demonstrates three special abilities.

First of all, it can understand and generate works of different artistic styles, and the 3D structure of the object will not deform during the rotation display. Imagen Video is a collection of a series of models. The language model part is Google's own T5-XXL, and the text encoder part is frozen after training. Among them, the language model is only responsible for encoding text features and handing over the text-to-image conversion work to the subsequent video diffusion model. Based on the image generation, the basic model continuously predicts the next frame in an autoregressive manner, and first generates a video of 48*24 3 frames per second. The flowchart from text prompt input to the video generation is shown in the figure below:

From the source paper: https://imagen.research.google/video/paper.pdf

Phenaki Google

Before Phenaki, the AI model can generate an ultra-short video with a specific prompt, but cannot generate a 2-minute coherent video.Phenaki realizes the brain storyline and generates videos of more than 2 minutes.

researchers have introduced a new causal model to learn to represent videos: treating videos as a time series of images. The model is based on Transformer, which can decompose videos into discrete small representations, while decomposing videos is carried out in a causal order of time. That is, a single prompt is encoded through a spatial Transformer, and then a causal Transformer is used to connect multiple encoded prompts. The flowchart is as follows:

From source paper: https://openreview.net/pdf?id=vOEXS39nOF

The impact of text generation video

With the rapid development of text generation video technology, in the future, the videos of major short video platforms may no longer be live performance shows, but rather a show of synthetic videos, which will bring economic impact to people who rely on video editing and recording of short video platforms.

AI is innovating various industries, bringing challenges while also bringing more progress. Daniel Jeffries, the new CIO of Stability AI, said that AI will eventually bring more jobs. At any time, challenges and opportunities coexist. Grasping the pulse of the times can create a better future.