This is not Morgan Freeman, but if you are not told, how do you know?
Imagine the following scenario. A phone rang. An office worker answered the phone and heard the boss telling him in panic that she forgot to transfer money to the new contractor before leaving and needed him to do it. She gave him the wire transfer information, and after the transfer, the crisis was avoided.
The worker leaned on the chair, took a deep breath, and watched his boss walk into the door. The sound on the other end of the phone was not his boss. In fact, it is not even human. The sound he heard was a sound of audio depth fakes, a machine-generated audio sample designed to sound exactly like his boss.
attacks using recorded audio like this have already occurred, and conversational audio depth forgery may not be too far away.
In recent years, with the development of complex machine learning technologies, Deepfake (including audio and video) has become possible. Deepfakes brings new uncertainty to digital media. To detect depth forgery, many researchers have turned to analyzing visual artifacts found in video depth forgery—minor malfunctions and inconsistencies.
Audio depth forgery can pose a greater threat because people often communicate verbally without video—for example, over the phone, radio, and recording. These pure voice communications greatly expand the possibility of attackers using deep forgery.
To detect audio depth forgery, we and our research colleagues at the University of Florida have developed a technology that can measure the acoustic and fluid dynamic differences between speech samples organically created by human speakers and speech samples generated by computers.
Organic and synthetic sounds
Humans make sounds by forcing air to flow through various structures of the vocal tract (including the vocal cords, tongue and lips). By rearranging these structures, you can change the acoustic properties of the vocal tract, allowing you to create over 200 different sounds or phonemes. However, human anatomy fundamentally limits the acoustic behavior of these different phonemes , resulting in a relatively small correct range of sound for each phoneme.
In contrast, audio depth forgery is created by first allowing the computer to listen to recordings of the target victim speaker. Depending on the specific technology used, the computer may only need to listen to 10 to 20 seconds of audio. This audio is used to extract key information about the unique aspects of the victim's voice.
attacker selects a phrase for deepfake to speak, and then uses the modified text-to-speech algorithm to generate an audio sample that sounds like the victim uttering the selected phrase. The process of creating a single deepfake audio sample can be completed in seconds, which may give the attacker enough flexibility to use deepfake voice in conversations.
Detection of audio depth forgery
The first step to distinguish human-generated voice from deepfakes is to understand how to acoustic modeling of the vocal trajectory. Fortunately, scientists have the technology to estimate the sound of someone (or creatures like dinosaurs) based on anatomical measurements of their vocal tracts.
We do the opposite. By inverting many of these same techniques, we are able to extract approximations of the speaker's vocal trajectory in a piece of speech. This allows us to effectively observe the anatomy of the speakers that create audio samples.
Deepfaked Audio usually results in vocal reconstruction similar to straws rather than bio-vocal tracts.
From here, we assume that deepfake audio samples are not bound by the same anatomy limitation as humans. In other words, the analysis of deep-fake audio samples simulates the shape of the vocal tract that does not exist in humans.
Our test results not only confirm our hypothesis, but also reveal something interesting. When extracting channel estimates from deepfake audio, we find that these estimates are usually incorrect.For example, deepfake audio often results in the same relative diameter and consistency as the straw, while the human vocal duct is wider and more variable in shape.
This understanding shows that even if it is persuasive to human listeners, deepfake audio is far from distinguishing from human-generated voices. By estimating the anatomy of the speech responsible for creating the observed speech, it is possible to determine whether the audio is generated by a human or a computer.
Why this matters
Today's world is defined by the digital exchange of media and information. Everything, from news to entertainment to conversations with loved ones, usually happens through digital communication. Even in their early days, deepfake videos and audio can undermine confidence in these communications, effectively limiting their usefulness.
If the digital world is to continue to be a key resource for information in people's lives, it is crucial to determine the source of audio samples.