Now leave it to the AI, and you can do it with a few words: look at the ground and grab the golf club, swing the club, trot for a while, and squat down. Now, there is no need to edit or edit, just enter a few commands in order, and 3D characters can automatically complete each se

Category：news

2025-04-29

Xiao Xiao from Aofei Temple
qubit | Official account QbitAI

Let the 3D animation villain do a set of silky actions. How long does it take to render manually?

is now handed over to AI, and you can get by entering a few sentences ( different color color represents different movements) :

Look at the ground and grab the golf club, swing the club, trot for a while, and squat down.

Previously, AI-controlled 3D mannequins could only "do one action at a time" or "complete one instruction at a time", and it was difficult to complete the instructions continuously.

Now, there is no need to edit or edit, just enter a few commands in order, and 3D characters can automatically complete each set of actions, and the whole process is smooth and bug-free.

The new AI is called TEACH, from MapSouth and Gustav Eiffel University.

Netizens have a lot of ideas:

In this way, can you do it by just using scripts?

Obviously, the gaming and simulation industries can consider it.

So, how did such a 3D character action artifact come from?

uses an encoder to "remember" the architecture of the previous action

TEACH, based on another 3D human motion generation framework proposed by the team not long ago TEMOS.

TEMOS is designed based on Transformer architecture and uses real movement data of the human body for training.

It will use two encoders during training, namely the action encoder (Motion Encoder) and the text encoder (Text Encoder) , and the same is output through the action decoder (Motion Decoder) .

However, when used, the original action encoder will be "thrown away" and only the text encoder will be retained. In this way, after the model directly inputs text, the corresponding action can be output.

is different from other AIs that input single text and output deterministic actions. TEMOS can generate a variety of different human movements through single text .

For example, a single instruction such as "people go around in circles" and "stop while standing for a few steps" can generate several different movement methods:

△The rotation method and walking pace are different

TEACH's architecture is based on TEMOS design, and the action encoder is moved directly from TEMOS.

But TEACH has redesigned the text encoder, which includes an encoder called Past Encoder, which provides the context of the previous action when generating each action to increase coherence between actions.

If it is the first action in a series of instructions, Past Encoder is disabled. After all, there is no previous action to learn.

TEACH is trained on the BABEL dataset, a 43-hour motion capture dataset that contains transitional actions, overall abstract actions, and specific actions for each frame.

During training, this series of motion capture data of BABEL will be divided into many subsets, each subset contains some transition actions, allowing TEACH to learn to transition and output.

As for why it is not used to train with another dataset KIT, the authors also gave their own opinions.

For example, in verb types, BABEL appears more specific than KIT, which prefers to use "fuzzy" words like do/perform.

researchers compared TEACH with TEMOS on the generation effect of continuous action.

is better than TEMOS

First, let’s take a look at the effect of TEACH generating a series of actions, and it is not repeated continuously:

Then, the researchers compared TEMOS with TEACH.

They trained the TEMOS model using two methods and called them Independent and Joint respectively. The difference lies in the data used for training.

In which Independent directly trains with a single action, and integrates the two actions in the first and last, spherical linear interpolation, etc. during generation; Joint directly uses the language tags separated by the action pair and the separated action tags as input.

Slerp is a linear interpolation operation, which is mainly used to smooth interpolation between two quaternions representing rotation, making the transformation process look smoother.

to generate two consecutive movements as an example.

Independent has the worst performance, the character sat down on the spot; the Joint effect is better, but the character does not raise his left hand; the best effect is TEACH, after waving his right hand, he raised his left hand, and finally let it go.

test on the BABEL dataset shows that the generation error of TEACH is the lowest, and independent and Joint perform well.

researchers also measured the best frame count for using the previous action, and found that when the 5 frame of the previous action was used, the transition action generated was the best.

The author introduces

Nikos Athanasiou, a graduate student who Map is studying for. His research direction is multimodal AI, and he likes to explore the relationship behind human actions and language.

Mathis Petrovich studied for a PhD at the University of Gustave Eiffel , and also worked at the Map & P.M. The research direction is to generate real and diverse human movement based on labels or text descriptions.

Michael J. Black, director of the Marx Planck Institute for Intelligent Systems, has now cited 62,000+ times in Google academic papers.

Gul Varol, assistant professor at Gustav Eiffel University, research directions are computer vision , video feature learning, human motion analysis, etc.

TEACH is currently open source. Interested friends can click the address below to experience it~

GitHub Address:
https://github.com/athn-nik/teach

Paper Address:
https://arxiv.org/abs/2209.04066

— End —

Quantum QbitAI · Toutiao Sign

news Latest News

Site article recommendation