Fish and Sheep Alex from Aofei Temple
Quantum bits | Official account QbitAI
painter holds a pen and pokes dots on the canvas, forming the unique brushstrokes of hand-painted works.
Which documentary do you think is this?
No, No, No! Every frame in the
video, is generated by AI.
or you tell it to "close up the brush on the canvas", and it can directly create the picture.
can not only make a brush out of nothing, but it is not impossible to drink water by pressing the horse's head.
is also a sentence like "Horse drinks water", this AI throws out the picture:
What a good guy, this is how you can really shoot videos in the future depends entirely on the rhythm of ...
is good, and the text to Image of AI painting is making it prosperous, and the researchers of Meta AI have also made a super evolution for the generation of AI.
This time, I can really "make videos with my mouth":
AI is named Make-A-Video, and it directly generates ascension dynamics from DALL·E and Stable Diffusion.
gives it a few words or lines of text to generate video images that do not exist in this world, and the style you master is still very diverse.
not only can the documentary style be held, but there is no problem with the full sci-fi effect. The two styles of
are mixed, and the picture of the robot dancing in Times Square does not seem to be inconsistent.
literary and fresh animation style, it seems that Make-A-Video has also grasped it. After such a wave of operation of
, many netizens were stunned, and even the comments were simplified to three letters:
, and the boss LeCun said meaningfully: What should come will always come.
After all, in a word, many industry insiders thought that "it's so fast." But Meta's move is indeed a bit quick:
is 9 months faster than I imagined.
even said: I can't adapt to the evolutionary speed of AI...
Text image generation model Super Evolution Edition
You may think Make-A-Video is a video version of DALL·E.
In fact, that's about it (manual dog head) .
mentioned earlier that Make-A-Video is a hyperevolution of the text image generation (T2I) model. That is because the first step in AI work is actually relying on text to generate images.
from the data point of view, it is the training data of static image generation models such as DALL·E, and it is paired text-image data.
. Although Make-A-Video eventually generates video, it does not specifically use paired text-video data training, but still relies on text-image to data to let AI learn to reproduce the picture based on text. Of course,
video data is also involved, but it mainly uses separate video clips to teach AI the real world of movement.
Specifically for the model architecture, Make-A-Video mainly consists of three parts:
- text image generation model P
- spatiotemporal convolution layer and attention layer
- used to improve frame interpolation network and two super-segment networks used to improve image quality
The entire model work process is Jiang Aunt:
First, generate image embedding based on the input text.
Then, decoder Dt generates 16 frames 64×64 RGB images.
interpolation network ↑F interpolates the preliminary results to achieve an ideal frame rate.
Next, the first super-segment network will increase the resolution of the screen to 256×256. The second super-segment network continues to be optimized, further improving the image quality to 768×768.
Based on this principle, Make-A-Video can not only generate videos based on text, but also has the following capabilities.
converts a static image into a video:
generates a video based on the first and last two pictures:
generates a new video based on the original video:
refreshes the text video generation model SOTA
In fact, Meta's Make-A-Video is not the first attempt of text-generating video (T2V) .
For example, Tsinghua University and Zhiyuan launched their self-developed "one-sentence video generation" AI: CogVideo, and this is currently the only open source T2V model.
Earlier, GODIVA and Microsoft "Nuwa" also achieved the generation of videos based on text descriptions.
However, this time, Make-A-Video has significantly improved the production quality. The experimental results of
on the MSR-VTT dataset show that Make-A-Video significantly refreshed SOTA in both FID (13.17) and CLIPSIM (0.3049).
In addition, the Meta AI team also used Imagen's DrawBench to conduct human subjective assessments.
They invite testers to experience Make-A-Video for themselves and subjectively evaluate the logical correspondence between video and text.
results show that Make-A-Video is better than the other two methods in terms of quality and loyalty.
One More Thing
Interestingly, while Meta released its new AI, it also seemed to kick off the T2V model racing.
Stable Diffusion's parent company StabilityAI can't sit still. Founder and CEO Emad said:
We will release a better model than Make-A-Video, the kind that everyone can use!
And just a few days ago, a related paper also appeared on the ICLR website Phenaki. The effect of
generation is as follows:
is right. Although Make-A-Video has not been published yet, Meta AI official also stated that it is preparing to launch a demo for everyone to experience it in practice. Interested friends can squat it~
Paper address:
https://makeavideo.studio/Make-A-Video.pdf
Reference link:
[1]https://ai.facebook.com/blog/generative-ai-text-to-video/
[2]https://twitter.com/boztank/status/ 1575541759009964032
[3]https://twitter.com/ylecun/status/1575497338252304384
[4]https://www.theverge.com/2022/9/29/23378210/meta-text-to-video-ai-generation-make-a-video-model-dall-e
[5]https://phenaki.video
— End —
Quantum bits QbitAI · Toutiao Sign
Follow us and learn about cutting-edge technology dynamics