15 second sample alone needed to make AI voice clone

Voice Engine is an open source program developed by OpenAI that allows users to create voice clones from a short 15-second speech sample. The results are unnervingly accurate. With the voice clone tool, users can make clips of speakers reading that sound pretty flawlessly like a real person. They can also make the clones speak in different languages, which is the intended use of the tool. AI has improved robo text-to-speech significantly. Famous robo voices like Siri, Alexa, or that too-chipper TikTok voice all have the telltale halts, clips and strange inflections that set them apart from regular human speech. "When-you ARE FIN.ished re-CO-rDing, you may hang-up/or/press POUND. For mo/re optioNS." No one talks like that, it's pretty easy to tell it apart from the real thing.

But now, it's much harder to tell the difference between something a real person has said in full, and a sound byte that has been trained to sound like it was. Inflection, pauses, accent, irregularity, errors all make up what makes a person's distinct voice. If you listen very, very closely to these 15-second trained clips, you can hear a little bit of the too-regular cadence. Overall, though, these clips are pretty much indistinguishable from the enthusiastic voice actor's own theatrical speaking voice.

I do wonder where this will lead the voice acting industry (among other fields). For professional reads on, say, a car insurance ad, regularity and tone are important. AI voice can do this very effectively, but it's (probably) still easier to ask a human to put a little more emphasis on a certain word, or change the emotional feeling of the reading. Telling a program to read something "happy" typically results in TikTok's even, bubbly tone. But it's probably a little but harder to tell Voice Engine to start a read as reserved but excited and interested, then eventually relieved and joyful. Eventually, programs like these will be able to do the extraordinary and become just as good as professionally trained actors in reading. This will to lead to legal and ethical realities that are only beginning to be addressed.

At the same time, the US government is trying to curb unethical uses of AI voice technology. Last month, the Federal Communications Commission banned robocalls using AI voices after people received spam calls from an AI-cloned voice of President Joe Biden.

According to OpenAI, its partners agreed to abide by its usage policies that say they will not use Voice Generation to impersonate people or organizations without their consent. It also requires the partners to get the "explicit and informed consent" of the original speaker, not build ways for individual users to create their own voices, and to disclose to listeners that the voices are AI-generated. OpenAI also added watermarking to the audio clips to trace their origin and actively monitor how the audio is used. 

OpenAI suggested several steps that it thinks could limit the risks around tools like these, including phasing out voice-based authentication to access bank accounts, policies to protect the use of people's voices in AI, greater education on AI deepfakes, and development of tracking systems of AI content. 

Emilia David, The Verge

Hopefully these safeguards help out some.

Previously: Amazon product name is an OpenAI error message