Meta researchers have made a significant leap in the field of artificial intelligence art generation with Make-A-Video, a creative new technology (you guessed it) that can make videos with just text prompts. The results are impressive, with a wide variety, and without exception,

Category：technology

2025-04-13

Meta researchers have made a significant leap in the field of artificial intelligence art generation with Make-A-Video, a creative new technology (you guessed it) that can make videos with just text prompts. The results are impressive, with a wide variety, and without exception, all slightly thrilling.

We have seen the text-to-video model before, which is a natural extension of the text-to-image model, such as DALL-E, which outputs stills from the prompt. But while the conceptual jump from still to mobile images is small for the human brain, it is far from easy to implement it in a machine learning model.

Make-A-Video is not actually a game-changer in the backend, as the researchers pointed out in the paper describing it: "A model that has only seen text describing images is surprisingly effective in generating short videos."

Artificial Intelligence uses existing and efficient diffusion techniques to create images, which is basically working in reverse direction from purely visual static "denoising" toward the target prompt. What is added here is that the model is also trained unsupervised on a bunch of untag video content (i.e. it checks the data itself without strong guidance from humans).

What it knows from the first item is how to make a real image; what it knows from the second item is what the continuous frames of the video look like. Surprisingly, it is able to put these things together very effectively without the need to specifically train how they should be combined. "In all aspects of spatial and temporal resolution, loyalty to text and quality, Make-A-Video sets new advanced levels of text-to-video generation, determined by qualitative and quantitative measures," the researchers wrote.

This is hard not to agree. Previous text-video systems used a different approach, and the results were not impressive, but gave people hope. Now Make-A-Video takes them to a new level, achieving fidelity consistent with images from original DALL-E or other previous generation systems, perhaps 18 months ago.

But it must be said that something is wrong with them. Not that we should expect realistic effects or completely natural movement, but there is a kind of result... Well, there is no other word to describe - they are a bit like nightmare, aren't they?

Their quality is a bit scary, it really looks like a nightmare. The quality of the movement is strange, as if it was a stop-motion animation movie. Corruption and artifacts give each piece a furry surreal feeling, like an object is about to melt away. People merge with each other, have no understanding of the boundaries of objects, and do not know what should end or contact.

I said these are not as some kind of artificial intelligence snob, just want to get the best high-definition and realistic images. I just think that no matter how realistic the videos are in some sense, they are so weird and offensive in other ways, which is fascinating. They can be generated quickly and arbitrarily, which is incredible and will only get better and better. But even the best image generators still have that surreal quality that is hard to figure out.

Make-A-Video also allows for the conversion of still images and other videos into their variants or extensions, just as the image generator can also prompt the image itself. The result is slightly less disturbing.

This is indeed a huge improvement over the previous AI tools that exist, and the team should be praised. It is not open to the public yet, but you can register first to apply to join the list of any form of access they decide later.

technology Latest News

Site article recommendation