Movies, games, podcasts: which industries need voice deepfakes

Voice cloning hasn’t been around for very long, but the technology is already showing impressive results, and it’s not just celebrity parodies. Trends figured out who and how deepfakes help, as well as what their threat is

A deepfake is an imitation of a video, audio, or photo that appears to be genuine but is the result of being manipulated by artificial intelligence (AI) technology. Ian Goodfellow, director of machine learning at Apple Special Projects Group, coined the term “deepfake” in 2014 while still a student at Stanford University. The concept was born as a result of combining the first part of deep learning (from the English “deep learning”) and the word fake (“fake”)

Deepfakes creates a generative adversarial algorithm. He learns, like a man, from his own mistakes, as if competing with himself. The system “scolds” the algorithm for errors and “encourages” it for correct actions until it produces the most accurate fake.

With the development of AI technology, creating deepfakes is getting easier. To get a speech clone, it is enough to record your voice for some time, avoiding reservations and other interferences, and then send the resulting file for processing to a company that provides such a service, or upload it yourself to a special program. Dozens of startups already offer similar services, including Resemble, Descript, CereVoice Me, and more.

A couple of years ago, the most realistic deepfakes were created by recording a person’s voice, dividing his speech into component sounds, and then combining them to form new words. Now neural networks can be trained on a set of speech data of any quality and volume thanks to the principle of competition, which obliges them to determine the speech of a real person faster and more accurately. As a result, if systems used to require tens or even hundreds of hours of sound, now realistic voices can be generated from just a few minutes of audio content. Companies are looking to commercialize the technology and are already offering it for use in some areas.

For advertising, films and dubbing

Veritone launched the MARVEL.ai service in the spring of 2021 to create and monetize voice deepfakes. She notes that the technology will allow influencers, athletes and actors to license the use of their voice deepfakes to create products with their participation, such as commercials, without having to visit the studio. The company guarantees the protection of such a deepfake from illegal copying and use with the help of built-in “watermarks”.

Voice deepfakes that the company creates can be adjusted in tone, change the gender of the speaker, and also translate the text into other languages.

Microsoft offered a similar service to partners in early 2021. On the Microsoft Azure AI platform, you can synthesize celebrity voices that are indistinguishable from live originals. For example, the American telecom company AT&T greets visitors with the voice of Bugs Bunny at an experience store in Dallas. He greets each guest by name and maintains a dialogue with them while shopping. To voice Bugs Bunny, the actor recorded 2 phrases for Microsoft.

For podcasts and audiobooks

Voice deepfake technology is built into podcast editing software developed by American firm Descript. The Overdub feature allows a podcaster to create an AI clone of their voice so that producers can quickly edit episodes. The function helps not only to delete unnecessary words, but also to replace them with new ones. To use Descript, it is enough to “say” the required amount of text.

The tool is already being used by Pushkin Industries, which collaborates with podcasters and audio storytellers such as Malcolm Gladwell (Revisionist History), Michael Lewis (Against the Rules) and Ibram X. Candy (Be Antiracist).

Voice deepfake threats

Researchers at the University of Chicago’s SAND Lab have tested voice synthesis programs available on the open-source Github developer platform. It turned out that they can fool the voice assistants Amazon Alexa, WeChat and Microsoft Azure Bot.

So, the SV2TTS program takes only 5 seconds to create an acceptable simulation. The program was able to deceive the Microsoft Azure bot in about 30% of cases, and in 63% of cases the deepfake could not be recognized by WeChat and Amazon Alexa voice assistants. In the case of real volunteers, more than half of the 200 people could not guess that it was a deepfake.

Researchers see this as a serious threat in terms of fraud, as well as attacks on entire systems. For example, WeChat allows users to sign in to an account with their voice, while Alexa allows them to use voice commands to make payments.

Similar stories have happened over and over again. In 2019, scammers used a voice deepfake to fool the head of a British energy company. The man was sure that his boss from Germany was calling him, and transferred more than $ 240 thousand to the scammers.

Companies that offer deepfakes as a service do not deny that they can be used maliciously. At the same time, they offer services for creating almost live voices. For example, San Francisco-based startup Lyrebird claims it can generate “the world’s most realistic artificial voices” using Descript, a program that creates a speech clone after downloading a one-minute recording.

The problem with the commercial use of voice deepfakes is that ownership of a person’s voice does not exist in any country in the world. The issue of protecting the rights of the deceased in relation to the use of their voice is also still open.

In addition, there is currently no legislative practice in any country in the world that could affect the procedure for removing deepfakes. In the US and China, they are only developing laws to regulate their use. For example, California banned the use of deepfakes in advertising. In our country, the fight against deepfakes was included in one of the Digital Economy roadmaps in July 2021.

The only exception is when the person’s name is registered as a commercial brand. These are usually celebrities. In 2020, the American YouTube channel Vocal Synthesis posted several generated humorous recordings of rapper Jay-Z’s lyrics without commercial gain. All videos were captioned that the celebrity’s speech was synthesized. However, the concert company RocNation, which is owned by Jay-Z, filed a copyright infringement lawsuit and demanded that the video be removed. As a result, only two of Jay-Z’s four videos were removed – it was recognized that the resulting sound product was a derivative work that had nothing to do with any of the rapper’s songs.

Ethical nuances

Deepfakes can be used for good. However, there are ethical issues. Thus, the documentary film “Running: A Film about Anthony Bourdain” about chef Anthony Bourdain was called unethical by both critics and the audience. Its creators used in the film the voice of Bourdain, generated by a neural network, and voiced phrases to them that the chef never actually said. Film critics, who were unaware of this when viewing the film, also condemned the writers and called their actions fraud and manipulation of the audience.

Meanwhile, startup Sonantic has announced that it has created a voice clone of actor Val Kilmer who is barely able to speak after undergoing a tracheotomy as part of his treatment for laryngeal cancer. The company used its own AI model Voice Engine. The actor thanked the team.

Sonantic notes that the company’s own app allows creative teams to enter text and then adjust its key parameters, including pitch and tempo.

Perspectives of application

Voice specialists and announcers believe that deepfakes can be really useful for the mechanical processing of voice – in messengers, when creating ads, and so on, but they cannot compete with real people where emotions are required. However, companies are working on it. Resemble AI, for example, already suggests using a form of modulation when creating a deepfake, which changes intonation and adds emotion to speech.

TikTok became the first social network to offer automatic text message voiceover at the end of 2020. However, the voice acting had to be changed. It turned out that the synthesized female voice actually belongs to a real person – voice actress Bev Standing, who previously collaborated with the Chinese Institute of Acoustics. A woman sued TikTok.

Paradoxically, speech deepfakes can provide security. Startup Modulate is testing “voice skinning” technology, which uses machine learning algorithms to adjust the sound patterns of a person’s voice to sound like someone else. To teach its technology to voice many different tones and timbres, the company collected and analyzed audio recordings of hundreds of actors reading scripts. Modulate says its technology will allow people to securely chat in game chats and participate in other online voice meetings.

Leave a Reply