Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself.

2025/05/0814:08:35 news 1035

The Heart of the Machine Report

Editor: Zhang Qian

This framework can convert character videos into animations, and it is still high-definition and highly controllable.

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow, which is not as similar to myself.

Recently, a related study from Nanyang Technological University in Singapore has received thousands of likes on reddit and Twitter . They developed a framework that can perform controllable high-resolution portrait video style conversion - VToonify, which has excellent performance in style control flexibility, quality of generated videos, and time coherence.

You can flexibly adjust the generated style type and the degree of cartooning according to your needs and other indicators:

From the demo, it can be seen that these portraits generated by VToonify not only have a highly adjustable cartoon style, but also contain many details of the portrait, which has a feeling of being thousands of people. Therefore, many netizens said that with this tool, isn’t it easy to make animated movies?

Someone still wants to apply it to the VR field.

When asked if it can be used as a real-time filter, the author said: The model is still very large at present, and it still requires some engineering efforts to achieve real-time.

Paper overview

Paper link: https://arxiv.org/pdf/2209.11224.pdf
Project link: https://github.com/williamyang1991/VToonify
demo Link: https://huggingface.co/spaces/PKUWilliamYang/VToonify
colab Link: https://colab.research.google.com/github/williamyang1991/VToonify/blob/master/notebooks/inference_playground.ipynb

Generating high-quality artistic portrait videos is an important task in computer graphics and computer vision . Although based on the powerful StyleGAN, researchers have proposed a series of successful portrait cartoon models, these image-oriented methods have obvious limitations when applied to videos, such as fixed frame size, face alignment requirements, lack of non-face details and time inconsistencies.

That is, an efficient way to cartoonize videos requires overcoming the following challenges:

is able to handle unaligned faces and different video sizes to keep the movement natural. Increasing the video size or using wide angles can capture more information to prevent faces from moving out of frames;
In order to match the currently widely used high-definition devices, the generated videos must have a high enough resolution;
If you want to build a practical user interaction system, the new method should provide flexible style control, allowing users to adjust and choose their favorite style.

In order to meet the above needs, researchers have proposed a hybrid framework specifically for video cartoonization - VToonify.

Specifically, they first analyzed the translational convolution of StyleGAN, which is the key to overcoming the limitations of "fixed frame size". As shown in Figure 2(c) below, VToonify combines the advantages of a StyleGAN-based framework and image conversion framework to achieve controllable high-resolution portrait video style conversion.

They use the StyleGAN architecture of [Pinkney and Adler 2020] for high-resolution video style conversion, but adjust StyleGAN by removing fixed-size input features and low-resolution layers to build a brand new fully convolutional encoder-generator architecture, similar to the architecture in the image conversion framework, supporting different video sizes.

In addition to the original high-level style code, they also train the encoder to extract multi-scale content features of the input frame as additional content conditions for the generator in order to better preserve the critical visual information of the frame during the style conversion process.

They follow the practice of [Chen et al. 2019; Viazovetskyi et al. 2020] and distillate StyleGAN on the synthetic pairing data.

In addition, they further proposed to simulate flicker suppression loss of camera motion based on single synthetic data to eliminate flicker.

Therefore, VToonify can learn fast and coherent video conversion without real data, complex video synthesis and explicit optical flow calculations.

is different from the standard image conversion framework in [Chen et al. 2019; Viazovetskyi et al. 2020], VToonify incorporates StyleGAN models into the generator to distillate the data and models. Therefore, VToonify inherits StyleGAN's style adjustment flexibility. By reusing StyleGAN as a generator, researchers only need to train the encoder, greatly reducing training time and training difficulty.

According to the above practices, researchers proposed two VToonify variants based on two representative StyleGAN backbones—Toonify [Pinkney and Adler 2020] and DualStyleGAN [Yang et al. 2022]—for portrait video cartoonization, respectively.

The former stylizes the face according to the overall style of the dataset, while the latter uses an image from the dataset to specify a finer style, as shown in the upper right corner of Figure 1.

Researchers adjust the characteristics of the encoder by adopting DualStyleGAN's style control module [Yang et al. 2022], and carefully design data generation and training goals. VToonify inherits DualStyleGAN's flexible style control and style adjustments, and further extends these functions to videos (as shown in the upper right corner of Figure 1)

collection-based portrait video style conversion

In the collection-based portrait video style conversion, the researchers used the representative Toonify as the backbone, which uses the original StyleGAN architecture and is conditioned only on the style code.

As shown in Figure 4, the collection-based VToonify framework contains encoders and generators built on Toonify. Accept video frames and generate content features, which are then input to generate the final stylized portrait. Unlike existing StyleGAN-based frameworks that use the entire StyleGAN architecture, they are built using only the most advanced 11-layer StyleGAN. As analyzed in [Karras et al. 2019], the low-resolution and high-resolution layers of StyleGAN mainly capture structure-related styles and color/texture styles, respectively. Therefore, the main task is to upsample content features and render stylized colors and textures for them.

exemplar-based Portrait Video Style Conversion

In the exemplar-based Portrait Video Style Conversion, the researchers used DualStyleGAN as the backbone, which added an external style path to StyleGAN, and was conditioned on the internal style code, external style code and style degree. The internal style code describes the characteristics of the face, while the external style code describes the structure and color style of the outside of the artistic portrait. Structural style degree and color style degree determine the intensity of the applied style. The

exemplar-based framework and the collection-based framework mentioned above have many similarities. It achieves flexible style control through two modifications. One is to use Modified ModRe to achieve structural style control, and the other is to add the Style-Degree-Aware fusion module. The complete architecture is shown in Figure 9.

Experimental results

Experimental results show that the stylized frames generated by VToonify are not only as high-quality as the backbone frame, but also better preserve the details of the input frame.

For more details, please refer to the original paper.

news

With the end of the National Day Golden Holiday, all statutory holidays this year have been closed, and three-quarters of them have passed in 2022. During the past 2022 period, when we look back, we can't help but be surprised to find that the topic words closely related to our d

Three-quarters of the year 2022, and the key theme words still cannot avoid the "epidemic"

05/08 1451

In the past, I taught my children to recite the second chapter of "Three Wolfs" in Pu Songling's Stories from the Strange Stories from a Chinese Studio. Among the Three Wolfs, the second chapter is more famous and has been selected into the junior high school Chinese textbook. In

Learn classical Chinese with your children (82) - Wolf (first)

05/08 1863

#Toutiao Creation Challenge# No matter where people go, they have to have some fun and think of some solutions. Yes, food is the fun I find, the way I think! Xiaotan Shike takes the food you eat seriously with you!

The veterans of Jiangxi entertained their comrades and treated them warmly with "3 dishes and 1 wine". The comrades were reluctant to go home after eating.

05/08 1990

On October 9, a reporter from Xiaoxiang Morning News contacted the person involved Mr. Tang. He said that his mother usually loves cleaning. She only washed the car with some oil smoke after it fell into the car with dishwashing detergent and steel wiping balls. The car was scrat

Jiangsu 70-year-old mother used steel wool detergent to help her son wash the car. Netizens: The car is full of mother's love

05/08 1341

People cannot choose their parents and their background, and the country also has things that they cannot choose. Only with territory can a certain population be supported, and only with population can one engage in production and establish a regime.

What are the advantages and disadvantages of China's geographical location in the world

05/08 1511

Electric toothbrushes have been popular among the public in recent years, but in the face of thousands of electric toothbrush brands and products, how do you distinguish the quality of electric toothbrushes?

How to distinguish the quality of electric toothbrushes? Beware of three major risks

05/08 1874

news

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself.

Self-directed and self-acted Pixar animation is no longer a dream, this tool realizes live-action video animation

05/08 1035

What are 15-year-olds doing? They should all study in school, and Wang Jianlin had already started to train in the army when he was 15 years old. If a person has enough faith, he can create miracles. Wang Jianlin is a person with faith. At the age of 15, he went to the army to be

What is the last 1,000 people left? Firm belief and perseverance give Wang Jianlin infinite strength

05/08 1870

In the 1980s and 1990s, the country's economy was just improving, but it was still in an era when there were few rich people and almost every family was poor. TVs are rare, and even radios are not good items that every family can see. So, if you have a neighbor with a TV or a rad

The past of the countryside - the youthful days of watching TV and listening to storytelling

05/08 1449

"It's really not a problem with money. I don't value money that much!" This is the first sentence Xiaotong (pseudonym) said after meeting with the reporter. There was a little childishness between the girl's eyebrows. If she had become more mature, she might not have fallen to su

Case: A 24-year-old girl was knocked out of her head when she was rewarded and took the initiative to live with a 45-year-old uncle. She was abandoned after she became pregnant

05/08 1094