How to make AI voice more full of human emotions? Recently, INTERSPEECH 2022, the world's top conference in the field of speech, announced the list of selected papers. The team's emotional speech synthesis paper co-written by Mobvoi and Professor Xie Lei's ASLP Laboratory of NPU

2024/06/2204:36:33 technology 1666
How does

make AI speech more full of human emotions?

Recently, INTERSPEECH 2022, the world's top conference in the speech field, announced the list of selected papers. The team's emotional speech synthesis paper co-written by Mobvoi and NPU Professor Xie Lei's ASLP Laboratory was selected and will be presented at the conference.

INTERSPEECH enjoys a high reputation in the world and has extensive academic influence. It is a flagship international conference founded by the International Speech Communications Association (ISCA). It is also the world's largest comprehensive technology event in the speech field. It is of great significance to participating companies and units. With strict entry requirements, all previous INTERSPEECH conferences have attracted widespread attention from people in the field of speech research around the world. The selection of this paper means that Mobvoi's scientific research strength and technological innovation capabilities in the field of speech synthesis have been recognized by the international academic community.

How to make AI voice more full of human emotions? Recently, INTERSPEECH 2022, the world's top conference in the field of speech, announced the list of selected papers. The team's emotional speech synthesis paper co-written by Mobvoi and Professor Xie Lei's ASLP Laboratory of NPU  - DayDayNews

Paper contribution: Implementation path of cross-speaker emotion transfer speech synthesis

How to make AI speech richer in human emotions and more emotionally expressive? Mobvoi elaborated in a paper titled "Cross-speaker emotion transfer based on prosodic compensation in end-to-end speech synthesis."

cross-speaker emotion transfer speech synthesis mainly transfers emotions from a source speaker with emotional data to a new target speaker without the emotion, so that the target speaker can express various emotions that do not exist in his training data. "Emotion transfer" is the most popular strategy in cross-speaker scenarios. In this study, it is crucial to extract speaker-independent emotion embeddings from the source speaker's emotion reference audio. Otherwise, the speaker information preserved in the emotional embedding will affect the timbre of the target speaker. However, in the process of eliminating the timbre information of the source speaker, the emotional information conveyed by the emotional embedding is often weakened, resulting in the synthesized target speaker's emotional speech expression being flat.

How to prevent the emotional information in emotional embedding from being weakened is a challenge. Specifically, in synthetic speech, reference embeddings with sufficient emotional information often lead to source speaker timbre leakage, while further elimination of speaker information in reference embeddings may weaken the transferred emotional expression. In order to cope with this challenge, Mobvoi proposed a prosody compensation strategy in the paper to compensate for the loss of emotional information caused by the elimination of speaker information in emotional embedding, so as to improve the emotional expression ability of synthesized speech. The

How to make AI voice more full of human emotions? Recently, INTERSPEECH 2022, the world's top conference in the field of speech, announced the list of selected papers. The team's emotional speech synthesis paper co-written by Mobvoi and Professor Xie Lei's ASLP Laboratory of NPU  - DayDayNews

paper expressed that the hidden representation generated by the pre-trained Automatic Speech Recognition (ASR) model retains certain prosodic information, but does not have obvious speaker information, so we proposed a prosody compensation module (PCM). The intermediate representation obtained by the ASR model using the reference audio is used as input to compensate for emotional information. The cross-speaker emotional speech synthesis model with prosody compensation proposed in this article includes the speaker decoupled module (speaker disentangling module, SDM), speaker embedding module and PCM module. Among them, SDM obtains speaker-independent emotional embedding from the reference spectrum, and PCM obtains additional emotional information from AIF to compensate for the loss of emotional information in the emotional embedding caused by decoupling the speaker's timbre. In order to effectively extract global prosody information from AIF, a prosody compensation encoder assisted by the global context module global context (GC) (shown in Figure 2) is also introduced. Experiments show that this method can effectively alleviate the impact of impaired emotional expressiveness in decoupled emotional embedding, and maintain the timbre of the target speaker while improving the emotional expressiveness of migration.

Speech synthesis example:

How to make AI voice more full of human emotions? Recently, INTERSPEECH 2022, the world's top conference in the field of speech, announced the list of selected papers. The team's emotional speech synthesis paper co-written by Mobvoi and Professor Xie Lei's ASLP Laboratory of NPU  - DayDayNews

Industry application: Create the industry's leading AI dubbing artifact "Magic Sound Workshop"

In recent years, Mobvoi's accumulation of voice technology has become increasingly mature, and it has gradually polished an AI dubbing product for consumers. ——"Magic Sound Workshop". This product is based on MeetVoice, the self-developed speech synthesis system developed by Mobvoi. It has accurate pronunciation and smooth rhythm. It has become a top dubbing artifact loved by short video creators.

Magic Sound Workshop has rich dubbing editing functions. In a word-like "editor" interface, it can easily realize all-round editing such as pause adjustment, multi-phonetic characters, multi-speakers, local speed changes, etc. It also has industry-original stress, drag and drop editing. Audio and other tuning functions make AI dubbing more comparable to real people.

But how to use the massive data of Magic Sound Workshop to combine speakers with different styles and different emotions so that it can have more speakers with rich emotions and diverse styles. How to make the emotions of speakers more vivid and abundant is the question of Magic Sound Workshop. The ultimate product experience that Yin Gongfang has always pursued.

's current speech synthesis system has a strong dependence on high-quality sound libraries with style/emotion matching. This technology can achieve the effect of "one person with a thousand voices" through style/emotion transfer. The implementation of this technology will greatly improve the construction efficiency of stylized and emotional speech synthesis systems and reduce the cost of system construction.

In order to achieve the effect of "one person with thousands of voices", Magic Sound Workshop has also developed and implemented "voice conversion", that is, migrating A's speaking style (rhythm and prosody, etc.) to B. The converted sound will have the timbre of B and the rhythm and rhythm of A.

How to make AI voice more full of human emotions? Recently, INTERSPEECH 2022, the world's top conference in the field of speech, announced the list of selected papers. The team's emotional speech synthesis paper co-written by Mobvoi and Professor Xie Lei's ASLP Laboratory of NPU  - DayDayNews

(Moyin Workshop product interface)

The "sound conversion" of "Moyin Workshop" can realize:

1. If the AI ​​synthesis effect is not good, such as broken sounds, unclear/not full pronunciation, etc., you can use this function. Let your AI anchor learn the broadcasting effect of other AI anchors, or learn your reading effect;

2, a certain place needs to be emphasized, but the AI ​​downplays it. At this time, you can try to use the voice conversion function to achieve the effect of "knowing the importance" ;

3. I want to drag the sound somewhere, but the AI ​​reading is relatively short and fast. At this time, use voice conversion to achieve the effect of "understanding priorities";

4. For a certain key line, the AI ​​synthesis effect is not good enough, and I feel that the AI ​​dubbing is not good enough. The effect is not good (for example, in the golden 10 seconds at the beginning of the video, users hope that the dubbing will be outstanding). At this time, you can try to use voice conversion to let your wonderful performance be empowered by the AI ​​pronunciator of Magic Sound Workshop to make the voice more beautiful. Vivid, full of emotions, more emotional.

This paper is part of our exploration. We look forward to the Magic Sound Workshop bringing more diverse speakers online, allowing everyone to become a director of sound, and helping the AI ​​dubbing industry to flourish.

In the future, Mobvoi will continue to deepen its research and development accumulation in voice and acoustics, and gradually implement it in more products and services. Use smarter technology to create a more thoughtful voice experience, full of emotions, and pronounce words "on demand". Make the interaction between people and machines more natural, and let AI enter more people's daily lives.

Paper: "Cross-speaker Emotion Transfer Based on Prosody Compensation for End-to-End Speech Synthesis"

Authors: Li Tao, Wang Xinsheng, Xie Qicong, Wang Zhichao, Jiang Mingqi, Xie Lei

technology Category Latest News