Almost Human column
Author: High Rainbow
Are you ready for the digital people symbiotic cyberpunk world yet?
As the backbone of many applications in the future virtual world, how to create lifelike virtual digital humans has always been an important research topic closely followed by artificial intelligence related disciplines such as computer vision, computer graphics and multimedia.
Recently, the AD-NeRF technology jointly created by Lu Shenshi Technology Co., Ltd., Zhejiang University and Tsinghua University in collaboration with the University of Science and Technology of China has aroused the attention of academia and industry.
Researchers from the Zhang Juyong research group of the University of Science and Technology of China and other institutions based on the neural radiation field (NeRF: Neural Radiance Fields) technology of the recent fire, proposed a algorithm to directly generate speaker video from voice signals . Only a few minutes of speaking video of the target character is needed, and this method can realize the reproduction and voice drive of the super-realistic image of the character.
Paper address: https://arxiv.org/pdf/2103.11078.pdf
Project address: https://yudongguo.github.io/ADNeRF/
"Make the virtual person construction at your fingertips And"
As artificial intelligence technology is moving towards a steady landing, the transformation and exploration of the practical application of new technologies in society has become a consensus reached by the academic and industrial circles. In this process, "digital virtual person" is undoubtedly a very "eye-catching" concept from the mainstream view. According to the final appearance of the target character,Digital virtual humans can be divided into 2D and 3D types, or animation, anthropomorphic, and real characters. In the Spring Festival Gala of 2021, the virtual idol Luo Tianyi was presented for the first time on the TV show at the time of family reunion. During the two sessions in March, the digital virtual reporter "Little C" created by CCTV.com, with a vivid role image, assumed the task of connecting with the representatives of the National People's Congress in real time and broadcasting policy news.
From top to bottom are Samsung virtual digital person Neon, virtual idol Luo Tianyi, movie character Alita.
According to the "2019 Virtual Idol Observation Report" released earlier by iQiyi, at least 390 million people in China are paying attention to virtual idols. At least tens of thousands of digital virtual human anchors are active on major short video platforms such as , Douyin , Kuaishou, and Station B. Not only in the field of pan-entertainment, digital virtual people also provide a wide range of imagination for a series of other social applications: virtual doctors, virtual teachers, virtual customer service, virtual shopping guide and so on.
is an important medium for human-computer interaction. How to efficiently construct a realistic appearance, natural demeanor and action of virtual humans has always been a hot research topic in this field. Among them, based on traditional computer graphics and animation technology, the construction of vivid and realistic virtual human behavior dynamics (such as mouth shapes and expressions that match the voice content) requires professional and complex human work, which greatly limits the virtual digital human’s widely used. In recent years, the virtual human construction technology based on deep learning methods has made good breakthroughs. However, in the existing learning-based methods, whether it is an image-based generative confrontation network ( GAN ) method, or a three-dimensional face reconstruction model-based face editing- rendering method, there is a large amount of training data, generation The result is poor quality and other issues. Take the SynthesizingObama work proposed by Suwajanakorn et al. in 2017 as an example.In order to realize the voice drive for the single role of Obama , this method uses up to 14 hours of video training data of Obama himself to ensure the final good image and video effects. However, many GAN-based facial speech-driven work is limited by the training complexity of the GAN model itself, and usually can only output video results with a resolution of no more than 256x256.
GAN-based method to generate image resolution is low, while AD-NeRF based on neural radiation field rendering supports arbitrary resolution rendering.
In the AD-NeRF method, only three to five minutes of speaking video of the target person is needed to achieve the effect of driving the person with any voice. Not only that, the result is high-definition image quality and natural facial expression, which is far better than previous methods. This "cheap and good quality" method requires only a small amount of training data to generate high-quality final results, which undoubtedly provides a powerful and convenient tool for creating virtual human images.
How does the face magic work?
The following example figure shows the algorithm flow framework of AD-NeRF work:
(1) Strong3strong speech to dynamic neural radiation field cross-modal mapping : in order to depict the face and torso of the speaker As well as the high-quality details and dynamics of the background, the authors combine DeepSpeech speech features with the latest neural radiation field method (NeRF), that is, to model an implicit function F whose input includes the assumed camera position, line of sight direction, and corresponding It outputs the color and density value of consecutive points along each ray. Through integration along the ray, the final color value of the pixel point pointed by the ray is determined.
(2) Complete and stable head and body torso synthesis : In view of the phenomenon that the movement of the face and the torso are not completely unified in the process of speaking,The authors split the original neural radiation field model into two implicit model representations with their own division of labor. First, they performed semantic segmentation on each frame of the training data, in which the face part uses multi-frame continuous optical flow to estimate the three-dimensional motion parameters, which are directly converted into hypothetical camera external parameters for training the head part. Nerve radiation field. The body module uses the head motion parameters as additional condition information based on the human head model to control the modeling of the body part. The obvious benefit of this design is that it solves the jitter effect caused by the inconsistency of head-body posture:
(3) supports background and perspective editing : due to the implicit portrayal of nerve radiation sites For three-dimensional information, the authors further explored the subsequent application of arbitrarily replacing the background and changing the observation angle. To implement these applications, you only need to change the assumed external camera parameters and background pictures while inputting the test audio. Examples of these applications can be seen in the figure below:
What possibilities does AD-NeRF bring?
Once upon a time, the digital man was still a cyberpunk subject that was loved by science fiction and movies; now, with the iterative update of digital virtual man’s creation technology, this futuristic concept is taking shape Enter the homes of ordinary people at an unprecedented speed. So, what practical virtual human applications will AD-NeRF bring technical possibilities?
is first in the field of video conferencing. As shown above, AD-NeRF can easily support the voice drive of any character. For video conferencing applications with high bandwidth requirements, it may no longer be necessary to transmit the video codec signal in real time, and only the audio signal is needed to drive the speaker's own virtual image. And the background replacement and posture editing supported by AD-NeRF, combined with AR helmets and other equipment, can make you feel like you are on the scene.Converse with each other in a three-dimensional scene that can be created arbitrarily.
Secondly, because AD-NeRF only needs a few minutes of video to train the dynamic radiation field of specific characters. If you want to leave a digital image of a close relative or friend and always be able to communicate face-to-face with him, then the algorithm design of AD-NeRF will greatly simplify the difficulty of making this digital image-eternal life in cyberspace may no longer be a game dream.
Finally, AD-NeRF undoubtedly has great potential for improving the current commercial digital virtual person construction process. Whether it's creating a realistic virtual anchor, a friendly virtual shopping guide, or a serious virtual teacher, etc., AD-NeRF can "get it at your fingertips." It only needs an expressive actor to record a voice video, and the rest can be handed over to the automated voice-driven technology. Its application prospects in commercial innovation are very broad.
is empowered by powerful technology. On the other hand, lower and lower thresholds and data requirements also make the creation of digital virtual humans face many risks and controversies. For example, fake digital images are used to steal other people's property or fake videos to spread false news, or even be used to deliberately demean others and insult their personality. Last year, a series of artificial intelligence applications such as DeepFake and Zao "AI face-changing" have triggered a wide range of discussions on the basis of morality and privacy in the whole society. Correspondingly, a series of DeepForensics themes have also been spawned in the academic world. "Face change detection" research.
At present, AD-NeRF uses a more advanced underlying algorithm from the application level, that is, implicitly modeling 3D motion details through neural radiation fields, rendering complete and realistic picture frames, for real and fake face videos Discrimination and detection also present more valuable challenges.
"The magic is one foot high, and the road is one foot high." For the needs of security and privacy protection, more powerful anti-counterfeiting and detection algorithms will inevitably work with virtual human technology in the future to become twin stars of common competition and development. From the perspective of fairness and justice, virtual humans, a product of the digital age, also need to be included in the constraints of laws, regulations and industry regulations. I believe that in the future, virtual digital humans will become synonymous with intelligence, convenience and reliability, and will provide greater help for improving information exchange and interpersonal interaction in this world.
.