Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself.

2025/05/0814:08:35 news 1035

The Heart of the Machine Report

Editor: Zhang Qian

This framework can convert character videos into animations, and it is still high-definition and highly controllable.

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow, which is not as similar to myself.

Recently, a related study from Nanyang Technological University in Singapore has received thousands of likes on reddit and Twitter . They developed a framework that can perform controllable high-resolution portrait video style conversion - VToonify, which has excellent performance in style control flexibility, quality of generated videos, and time coherence.

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

You can flexibly adjust the generated style type and the degree of cartooning according to your needs and other indicators:

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

From the demo, it can be seen that these portraits generated by VToonify not only have a highly adjustable cartoon style, but also contain many details of the portrait, which has a feeling of being thousands of people. Therefore, many netizens said that with this tool, isn’t it easy to make animated movies?

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

Someone still wants to apply it to the VR field.

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

When asked if it can be used as a real-time filter, the author said: The model is still very large at present, and it still requires some engineering efforts to achieve real-time.

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

Paper overview

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

  • Paper link: https://arxiv.org/pdf/2209.11224.pdf
  • Project link: https://github.com/williamyang1991/VToonify
  • demo Link: https://huggingface.co/spaces/PKUWilliamYang/VToonify
  • colab Link: https://colab.research.google.com/github/williamyang1991/VToonify/blob/master/notebooks/inference_playground.ipynb

Generating high-quality artistic portrait videos is an important task in computer graphics and computer vision . Although based on the powerful StyleGAN, researchers have proposed a series of successful portrait cartoon models, these image-oriented methods have obvious limitations when applied to videos, such as fixed frame size, face alignment requirements, lack of non-face details and time inconsistencies.

That is, an efficient way to cartoonize videos requires overcoming the following challenges:

  • is able to handle unaligned faces and different video sizes to keep the movement natural. Increasing the video size or using wide angles can capture more information to prevent faces from moving out of frames;
  • In order to match the currently widely used high-definition devices, the generated videos must have a high enough resolution;
  • If you want to build a practical user interaction system, the new method should provide flexible style control, allowing users to adjust and choose their favorite style.

In order to meet the above needs, researchers have proposed a hybrid framework specifically for video cartoonization - VToonify.

Specifically, they first analyzed the translational convolution of StyleGAN, which is the key to overcoming the limitations of "fixed frame size". As shown in Figure 2(c) below, VToonify combines the advantages of a StyleGAN-based framework and image conversion framework to achieve controllable high-resolution portrait video style conversion.

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

They use the StyleGAN architecture of [Pinkney and Adler 2020] for high-resolution video style conversion, but adjust StyleGAN by removing fixed-size input features and low-resolution layers to build a brand new fully convolutional encoder-generator architecture, similar to the architecture in the image conversion framework, supporting different video sizes.

In addition to the original high-level style code, they also train the encoder to extract multi-scale content features of the input frame as additional content conditions for the generator in order to better preserve the critical visual information of the frame during the style conversion process.

They follow the practice of [Chen et al. 2019; Viazovetskyi et al. 2020] and distillate StyleGAN on the synthetic pairing data.

In addition, they further proposed to simulate flicker suppression loss of camera motion based on single synthetic data to eliminate flicker.

Therefore, VToonify can learn fast and coherent video conversion without real data, complex video synthesis and explicit optical flow calculations.

is different from the standard image conversion framework in [Chen et al. 2019; Viazovetskyi et al. 2020], VToonify incorporates StyleGAN models into the generator to distillate the data and models. Therefore, VToonify inherits StyleGAN's style adjustment flexibility. By reusing StyleGAN as a generator, researchers only need to train the encoder, greatly reducing training time and training difficulty.

According to the above practices, researchers proposed two VToonify variants based on two representative StyleGAN backbones—Toonify [Pinkney and Adler 2020] and DualStyleGAN [Yang et al. 2022]—for portrait video cartoonization, respectively.

The former stylizes the face according to the overall style of the dataset, while the latter uses an image from the dataset to specify a finer style, as shown in the upper right corner of Figure 1.

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

Researchers adjust the characteristics of the encoder by adopting DualStyleGAN's style control module [Yang et al. 2022], and carefully design data generation and training goals. VToonify inherits DualStyleGAN's flexible style control and style adjustments, and further extends these functions to videos (as shown in the upper right corner of Figure 1)

collection-based portrait video style conversion

In the collection-based portrait video style conversion, the researchers used the representative Toonify as the backbone, which uses the original StyleGAN architecture and is conditioned only on the style code.

As shown in Figure 4, the collection-based VToonify framework contains encoders and generators built on Toonify. Accept video frames and generate content features, which are then input to generate the final stylized portrait. Unlike existing StyleGAN-based frameworks that use the entire StyleGAN architecture, they are built using only the most advanced 11-layer StyleGAN. As analyzed in [Karras et al. 2019], the low-resolution and high-resolution layers of StyleGAN mainly capture structure-related styles and color/texture styles, respectively. Therefore, the main task is to upsample content features and render stylized colors and textures for them.

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

exemplar-based Portrait Video Style Conversion

In the exemplar-based Portrait Video Style Conversion, the researchers used DualStyleGAN as the backbone, which added an external style path to StyleGAN, and was conditioned on the internal style code, external style code and style degree. The internal style code describes the characteristics of the face, while the external style code describes the structure and color style of the outside of the artistic portrait. Structural style degree and color style degree determine the intensity of the applied style. The

exemplar-based framework and the collection-based framework mentioned above have many similarities. It achieves flexible style control through two modifications. One is to use Modified ModRe to achieve structural style control, and the other is to add the Style-Degree-Aware fusion module. The complete architecture is shown in Figure 9.

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

Experimental results

Experimental results show that the stylized frames generated by VToonify are not only as high-quality as the backbone frame, but also better preserve the details of the input frame.

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

Using cartoon avatars to record videos on social networks and short video platforms is a favorite way of playing, but we will also find some problems, such as the avatar adjustment range is relatively narrow and not as similar to myself. - DayDayNews

For more details, please refer to the original paper.

news Category Latest News