Voice-driven 3D virtual human, interpretation of the latest articles from Baidu ACCV 2020

2020/11/2113:24:04 technology 1101

machine heart report

machine heart editorial department

This article comes from a paper "Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses" accepted by Baidu Research Institute by the Asian Computer Vision Conference ACCV 2020.

Speech2Video is a task of synthesizing the video of human body motion (including head, mouth, arms, etc.) from voice audio input. The resulting video should be visually natural and consistent with the given voice. The traditional Speech2Video method generally uses special equipment and professional operators for performance capture, and most of the speech and rendering tasks are completed by animators, and the cost of customization is usually expensive.

In recent years, with the successful application of deep neural networks, data-driven methods have become a reality. For example, SythesisObama or MouthEditing synthesizes a speaking mouth by using RNN to drive the movement of the mouth through speech. Taylor [3] proposed to use audio to drive a high-fidelity graphics model, which can not only animate the mouth, but also animate other parts of the face to obtain richer voice expression.

However, the synthesis of mouth movement is mostly deterministic: given pronunciation, the movement or shape of the mouth is similar in different people and environments. But in real life, the whole body gesture movement in the same situation has higher productivity and more variability. These gestures are highly dependent on the current context and the human being performing the speech. When delivering important information, personalized gestures will appear at specific moments. Therefore, useful information only sparsely exists in the video, which makes it difficult for simple end-to-end learning algorithms [1, 3] to capture this diversity from limited recorded videos.

Recently, Baidu proposed a new method to convert a given text or audio into a realistic video with synchronized, realistic and expressive body language. This method first uses a recursive neural network (RNN) to generate 3D skeletal motion from an audio sequence, and then synthesizes the output video through a Conditional Generative Adversarial Network (GAN).

Paper address: https://arxiv.org/pdf/2007.09198.pdf

In order to make bone movement realistic and expressive, the researchers embedded the knowledge of joint 3D human bones and the learned personalized speech gesture dictionary into learning and testing In the process. The former can prevent unreasonable body deformation, while the latter helps the model learn quickly through some meaningful body motion videos. In order to produce realistic high-resolution videos with rich motion details, the researchers proposed a conditional GAN, in which every detail, such as the head and hands, is automatically enlarged to have its own discriminator. This method is better than the previous SOTA method that deals with similar tasks.

method

Figure 1: Speech2Video system pipeline

is shown in Figure 1. According to the content used to train the LSTM network, the input of the system is audio or text. Considering that both text-to-speech (TTS) and speech-to-text (STT) technologies are mature and commercially available, it is assumed that audio and text are interchangeable. Even if some misrecognized words/characters are obtained from the most advanced STT engine, the system can tolerate these errors. The main purpose of the LSTM network is to map text/audio to body shapes. The wrong STT output is usually words that are similar to the real pronunciation, which means that their spelling is also likely to be similar. Therefore, they will eventually map to a more or less similar body shape. The output of

LSTM is a series of human poses parameterized by SMPL-X [9]. SMPL-X is a 3D joint model of human body, face and hands. This dynamic joint 3D model is visualized by a sequence of 2D color skeleton images. These 2D images are further input to vid2vid. Generate the network [17] to generate the final real person image. While

successfully synchronizes voice and action, LSTM can only learn repetitive human actions most of the time, which makes the video look boring. In order to make human movements more expressive and variability, researchers add specific postures to the output of LSTM when some keywords appear, for example, huge, tiny, high, low, etc. The researchers built a dictionary to map these keywords to their corresponding poses.

Figure 3 shows the data collection environment. The model stands in front of the camera and the screen, and the researcher captures these videos while he/she reads the script on the screen. Finally, ask the model to pose some key words, such as huge, tiny, up, down, me, you, etc.

human model fitting

The researchers first used these 2D key points as the representation of the human model and trained the LSTM network, but the results were not satisfactory (as shown in Figure 4).

finally adopted SMPL-X, which is an articulated 3D human body model. SMPL-X uses a kinematic skeleton model to model human body dynamics, with 54 joints, including neck, fingers, arms, legs and feet.

dictionary construction and key pose insertion

As shown in Figure 5, the researcher manually selects the key poses from the recorded video and builds a word-posture query dictionary. Similarly, this posture is expressed as 106 SMPL-X parameters. The key pose can be a static single-frame pose or a multi-frame motion, and both can be inserted into the existing human skeleton video by the same method. Z1z

training video generation network

researchers use the generation network proposed by vid2vid to convert skeleton images into real portraits.

Figure 7: Example image pair used to train vid2vid. Both hands are marked with a special color circle. In terms of running time and hardware of

, the most time-consuming and memory-consuming stage in the system is training the vid2vid network. It takes about a week to complete 20 epochs of training on 8 NVIDIA Tesla M40 24G GPU clusters; the test phase is much faster, and it only takes about 0.5 seconds to generate a frame on a single GPU. The results of

evaluation and analysis

are shown in Table 1. The researchers compared the results of the user study with the four SOTA methods. The results show that the method in this paper obtained the best overall quality score.

In addition, the researchers used the Inception score to evaluate the results of image generation, including two aspects: image quality and image diversity.

In order to evaluate the final output video, the researchers conducted a human subjective test on Amazon Mechanical Turk (AMT) with 112 participants. The researchers showed participants a total of five videos, four of which were synthesized videos, two were generated by real human audio, and two were generated by TTS audio; the rest was a short video of a real person. Participants rated the quality of these videos on a Likert scale (from 1 (strongly disagree) to 5 (strongly agree)). These include: 1) the integrity of the human body (no missing body parts or fingers); 2) clear faces in the video; 3) human movements (arms, hands, body gestures) in the video look natural and smooth. 4) Body movements and gestures are synchronized with sound; 5) The overall visual quality of the video.

Summary

Speech2Video is a novel framework that can use 3D-driven methods to generate realistic voice and video while avoiding the construction of 3D mesh models. The author established a personalized key gesture table within the framework to deal with data sparsity and diversity. More importantly, the author uses 3D skeletal constraints to generate body dynamics to ensure that his posture is physically reasonable.

reference link:

1. Suwajanakorn, S., Seitz, SM, Kemelmacher-Shlizerman, I.: Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) 36 (2017) 95

2. Fried, O., Tewari, A., Zollh¨ ofer, M., Finkelstein, A., Shechtman, E., Goldman, DB, Genova, K., Jin, Z., Theobalt, C., Agrawala, M.: Text-based editing of talking-head video. arXiv preprint arXiv:1906.01524 (2019)

3. Taylor, S., Kim, T., Yue, Y., Mahler, M., Krahe, J., Rodriguez, AG, Hodgins, J., Matthews, I.: A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG) 36 (2017) 93

4. Kim, BH, Ganapathi, V.: Lumi \ erenet: Lecture video synthesis from audio. arXiv preprint arXiv:1907.02253 (2019)

5. Pavllo , D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2019 ) 77537762

6. Cao, Z., Hidalgo, G., Simon, T., Wei, SE, Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using Part Aﬃnity Fields. In: arXiv preprint arXiv:1812.08008. (2018)

7. Pavlakos, G., Choutas, V., Ghorbani , N., Bolkart, T., Osman, AAA, Tzionas, D., Black, MJ: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). (2019)

8. Wang, TC, Liu, MY, Zhu, JY, Liu, G., Tao, A., Kautz, J., Catanzaro, B.:Video-to-video synthesis. In: Advances in Neural Information Processing Systems (NeurIPS). (2018)

9. Romero, J., Tzionas, D., Black, MJ: Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics (ToG) 36 (2017) 245

Amazon SageMaker1000 yuan spree

ML training cost reduced by 90%, used by tens of thousands of companies around the world, Amazon SageMaker is a fully managed machine learning platform, supports most machine learning frameworks and algorithms, and IDE code writing, visualization, and Debug all in one go.

Now, we have prepared a free quota of 1,000 yuan, which developers can experience by themselves, making it easier to develop high-quality models.