How to make AI voice more full of human emotions? Recently, INTERSPEECH 2022, the world's top conference in the field of speech, announced the list of selected papers. The team's emotional speech synthesis paper co-written by Mobvoi and Professor Xie Lei's ASLP Laboratory of NPU

2024/06/2204:36:33 technology 1666

How does

make AI speech more full of human emotions?

Recently, INTERSPEECH 2022, the world's top conference in the speech field, announced the list of selected papers. The team's emotional speech synthesis paper co-written by Mobvoi and NPU Professor Xie Lei's ASLP Laboratory was selected and will be presented at the conference.

INTERSPEECH enjoys a high reputation in the world and has extensive academic influence. It is a flagship international conference founded by the International Speech Communications Association (ISCA). It is also the world's largest comprehensive technology event in the speech field. It is of great significance to participating companies and units. With strict entry requirements, all previous INTERSPEECH conferences have attracted widespread attention from people in the field of speech research around the world. The selection of this paper means that Mobvoi's scientific research strength and technological innovation capabilities in the field of speech synthesis have been recognized by the international academic community.

How to make AI voice more full of human emotions? Recently, INTERSPEECH 2022, the world's top conference in the field of speech, announced the list of selected papers. The team's emotional speech synthesis paper co-written by Mobvoi and Professor Xie Lei's ASLP Laboratory of NPU - DayDayNews

Paper contribution: Implementation path of cross-speaker emotion transfer speech synthesis

How to make AI speech richer in human emotions and more emotionally expressive? Mobvoi elaborated in a paper titled "Cross-speaker emotion transfer based on prosodic compensation in end-to-end speech synthesis."

cross-speaker emotion transfer speech synthesis mainly transfers emotions from a source speaker with emotional data to a new target speaker without the emotion, so that the target speaker can express various emotions that do not exist in his training data. "Emotion transfer" is the most popular strategy in cross-speaker scenarios. In this study, it is crucial to extract speaker-independent emotion embeddings from the source speaker's emotion reference audio. Otherwise, the speaker information preserved in the emotional embedding will affect the timbre of the target speaker. However, in the process of eliminating the timbre information of the source speaker, the emotional information conveyed by the emotional embedding is often weakened, resulting in the synthesized target speaker's emotional speech expression being flat.

How to prevent the emotional information in emotional embedding from being weakened is a challenge. Specifically, in synthetic speech, reference embeddings with sufficient emotional information often lead to source speaker timbre leakage, while further elimination of speaker information in reference embeddings may weaken the transferred emotional expression. In order to cope with this challenge, Mobvoi proposed a prosody compensation strategy in the paper to compensate for the loss of emotional information caused by the elimination of speaker information in emotional embedding, so as to improve the emotional expression ability of synthesized speech. The

How to make AI voice more full of human emotions? Recently, INTERSPEECH 2022, the world's top conference in the field of speech, announced the list of selected papers. The team's emotional speech synthesis paper co-written by Mobvoi and Professor Xie Lei's ASLP Laboratory of NPU - DayDayNews

paper expressed that the hidden representation generated by the pre-trained Automatic Speech Recognition (ASR) model retains certain prosodic information, but does not have obvious speaker information, so we proposed a prosody compensation module (PCM). The intermediate representation obtained by the ASR model using the reference audio is used as input to compensate for emotional information. The cross-speaker emotional speech synthesis model with prosody compensation proposed in this article includes the speaker decoupled module (speaker disentangling module, SDM), speaker embedding module and PCM module. Among them, SDM obtains speaker-independent emotional embedding from the reference spectrum, and PCM obtains additional emotional information from AIF to compensate for the loss of emotional information in the emotional embedding caused by decoupling the speaker's timbre. In order to effectively extract global prosody information from AIF, a prosody compensation encoder assisted by the global context module global context (GC) (shown in Figure 2) is also introduced. Experiments show that this method can effectively alleviate the impact of impaired emotional expressiveness in decoupled emotional embedding, and maintain the timbre of the target speaker while improving the emotional expressiveness of migration.

Speech synthesis example:

How to make AI voice more full of human emotions? Recently, INTERSPEECH 2022, the world's top conference in the field of speech, announced the list of selected papers. The team's emotional speech synthesis paper co-written by Mobvoi and Professor Xie Lei's ASLP Laboratory of NPU - DayDayNews

Industry application: Create the industry's leading AI dubbing artifact "Magic Sound Workshop"

In recent years, Mobvoi's accumulation of voice technology has become increasingly mature, and it has gradually polished an AI dubbing product for consumers. ——"Magic Sound Workshop". This product is based on MeetVoice, the self-developed speech synthesis system developed by Mobvoi. It has accurate pronunciation and smooth rhythm. It has become a top dubbing artifact loved by short video creators.

Magic Sound Workshop has rich dubbing editing functions. In a word-like "editor" interface, it can easily realize all-round editing such as pause adjustment, multi-phonetic characters, multi-speakers, local speed changes, etc. It also has industry-original stress, drag and drop editing. Audio and other tuning functions make AI dubbing more comparable to real people.

But how to use the massive data of Magic Sound Workshop to combine speakers with different styles and different emotions so that it can have more speakers with rich emotions and diverse styles. How to make the emotions of speakers more vivid and abundant is the question of Magic Sound Workshop. The ultimate product experience that Yin Gongfang has always pursued.

's current speech synthesis system has a strong dependence on high-quality sound libraries with style/emotion matching. This technology can achieve the effect of "one person with a thousand voices" through style/emotion transfer. The implementation of this technology will greatly improve the construction efficiency of stylized and emotional speech synthesis systems and reduce the cost of system construction.

In order to achieve the effect of "one person with thousands of voices", Magic Sound Workshop has also developed and implemented "voice conversion", that is, migrating A's speaking style (rhythm and prosody, etc.) to B. The converted sound will have the timbre of B and the rhythm and rhythm of A.

How to make AI voice more full of human emotions? Recently, INTERSPEECH 2022, the world's top conference in the field of speech, announced the list of selected papers. The team's emotional speech synthesis paper co-written by Mobvoi and Professor Xie Lei's ASLP Laboratory of NPU - DayDayNews

(Moyin Workshop product interface)

The "sound conversion" of "Moyin Workshop" can realize:

1. If the AI synthesis effect is not good, such as broken sounds, unclear/not full pronunciation, etc., you can use this function. Let your AI anchor learn the broadcasting effect of other AI anchors, or learn your reading effect;

2, a certain place needs to be emphasized, but the AI downplays it. At this time, you can try to use the voice conversion function to achieve the effect of "knowing the importance" ;

3. I want to drag the sound somewhere, but the AI reading is relatively short and fast. At this time, use voice conversion to achieve the effect of "understanding priorities";

4. For a certain key line, the AI synthesis effect is not good enough, and I feel that the AI dubbing is not good enough. The effect is not good (for example, in the golden 10 seconds at the beginning of the video, users hope that the dubbing will be outstanding). At this time, you can try to use voice conversion to let your wonderful performance be empowered by the AI pronunciator of Magic Sound Workshop to make the voice more beautiful. Vivid, full of emotions, more emotional.

This paper is part of our exploration. We look forward to the Magic Sound Workshop bringing more diverse speakers online, allowing everyone to become a director of sound, and helping the AI dubbing industry to flourish.

In the future, Mobvoi will continue to deepen its research and development accumulation in voice and acoustics, and gradually implement it in more products and services. Use smarter technology to create a more thoughtful voice experience, full of emotions, and pronounce words "on demand". Make the interaction between people and machines more natural, and let AI enter more people's daily lives.

Paper: "Cross-speaker Emotion Transfer Based on Prosody Compensation for End-to-End Speech Synthesis"

Authors: Li Tao, Wang Xinsheng, Xie Qicong, Wang Zhichao, Jiang Mingqi, Xie Lei

technology

Although the Mate 40 series is a mobile phone launched by Huawei before the new year, the Mate 40 series still has good attention. Of course, it also supports 5G networks and is equipped with a Kirin 9000 processor. Now, a blogger's uncle who is watching the mountain revealed tha - DayDayNews

Although the Mate 40 series is a mobile phone launched by Huawei before the new year, the Mate 40 series still has good attention. Of course, it also supports 5G networks and is equipped with a Kirin 9000 processor. Now, a blogger's uncle who is watching the mountain revealed tha

Will the second-hand Huawei Mate 40 series be sold in Huawei Mall?

07/01 1437

The Double 11 that has been waiting for many days is here, and 88VIP users have many benefits. Come and see Tmall Double 11 is here. What makes everyone excited is that this year's Double 11 is very different from previous years. Not only are there many benefits, but there are al - DayDayNews

The Double 11 that has been waiting for many days is here, and 88VIP users have many benefits. Come and see Tmall Double 11 is here. What makes everyone excited is that this year's Double 11 is very different from previous years. Not only are there many benefits, but there are al

Double 11, which has been waiting for many days, is here, 88VIP users have many benefits, come and have a look

07/01 1702

Due to Huawei's suppression in the past two years and intelligently released 4G mobile phones, mobile phone sales have plummeted, and they have jumped out of the top five and entered the "other" series. However, innovation will eventually win the world. From the hot sales of Huaw - DayDayNews

Due to Huawei's suppression in the past two years and intelligently released 4G mobile phones, mobile phone sales have plummeted, and they have jumped out of the top five and entered the "other" series. However, innovation will eventually win the world. From the hot sales of Huaw

Even 4G mobile phones can win the first place in sales, which shows that Chinese people's recognition of Huawei is more than Apple.

07/01 1729

In 2011, Emmanuel Carpentier, a professor at the MaxPhotos Society in Germany, and Jennifer Dudner, a professor at the University of California, Berkeley, met at an academic conference and decided to jointly study the CRISPR/Cas9 technology. - DayDayNews

In 2011, Emmanuel Carpentier, a professor at the MaxPhotos Society in Germany, and Jennifer Dudner, a professor at the University of California, Berkeley, met at an academic conference and decided to jointly study the CRISPR/Cas9 technology.

Released next week! The theme summit of "Top Ten Breakthrough Technologies" is about to open grandly

07/01 1547

In August this year, Arm announced that it had not filed a lawsuit against mobile processor manufacturer Qualcomm and its subsidiary Nuvia, accusing the two companies of violating the license agreement signed with Arm and infringing Arm's patents. Recently, Qualcomm filed a count - DayDayNews

In August this year, Arm announced that it had not filed a lawsuit against mobile processor manufacturer Qualcomm and its subsidiary Nuvia, accusing the two companies of violating the license agreement signed with Arm and infringing Arm's patents. Recently, Qualcomm filed a count

After 2024, will Arm ban public CPUs from matching non-public GPUs/NPUs/ISPs?

07/01 1870

technology

Like a dark horse, the sub-brand of realme, realme, has emerged from the almost saturated domestic mobile phone market and directly snatched certain users from the two cost-effective brands of Redmi and Honor.

With three years of talent to avoid being stuck, 12G+256G has dropped to 2229 yuan, and the independent flagship is about to leave.

07/01 1673

If a phone is difficult to make you complain about anything, then such a phone is definitely successful. After you see such a phone, you really no longer want to consider other models. According to user feedback, the mobile phone that is difficult to complain about this year has - DayDayNews

If a phone is difficult to make you complain about anything, then such a phone is definitely successful. After you see such a phone, you really no longer want to consider other models. According to user feedback, the mobile phone that is difficult to complain about this year has

512G+5000mAh large battery + 100 million pixels, is it suitable to be as low as 2199?

07/01 1252

The arrival of Tmall Double Eleven not only means that young people can buy, but also allows the elderly to have their own old age life. The elderly also like to shop online and know that coupons for 50 off for every 300 yuan can make things much cheaper than usual. They often di - DayDayNews

The arrival of Tmall Double Eleven not only means that young people can buy, but also allows the elderly to have their own old age life. The elderly also like to shop online and know that coupons for 50 off for every 300 yuan can make things much cheaper than usual. They often di

Tmall Double 11 is here, the benefits of the elderly are here, Huang Ruo is here to help

07/01 1762

Tmall Double 11 is here, with constant benefits and surprises, making people feel happy and happy! This year's Tmall Double 11 kicked off at 8 pm on October 24, and it was already very exciting when pre-sales started. - DayDayNews

Tmall Double 11 is here, with constant benefits and surprises, making people feel happy and happy! This year's Tmall Double 11 kicked off at 8 pm on October 24, and it was already very exciting when pre-sales started.

Tmall Double 11 benefits are coming one after another, and surprises are endless. Have you felt it?

06/30 1659

[Mobile China News] Double 11 is coming soon, and laptops are about to usher in a peak in purchasing. When choosing laptops, in addition to paying attention to the conventional configurations such as chips, graphics cards, and screens, more and more people are beginning to pay at - DayDayNews

[Mobile China News] Double 11 is coming soon, and laptops are about to usher in a peak in purchasing. When choosing laptops, in addition to paying attention to the conventional configurations such as chips, graphics cards, and screens, more and more people are beginning to pay at

Huawei Double 11 benefits are coming! Huawei's only 2K touch screen all-around notebook at 4K price is worth knowing

06/30 1892