Cloning your voice using artificial intelligence is both tedious and straightforward – the hallmarks of a technology that’s pretty much mature and ready to go public.
All you have to do is speak into a microphone for about 30 minutes, reading a script as carefully as possible (in my case: the voiceover of a documentary by David Attenborough). After starting and stopping dozens of times to re-record your flubs and mumbles, you’ll send the resulting audio files to process, and in a few hours you’ll be notified that a copy of your voice is ready and waiting. Then you can type whatever you want into a chat box, and your AI clone will tell you, with the resulting realistic audio to fool even your friends and family, at least for a few moments. The fact that such a service even exists may be news to many, and I don’t think we’ve started to fully consider the impact that easy access to this technology will have.
Text-to-speech work has improved dramatically in recent years, thanks to advances in machine learning. Previously, the most realistic synthetic voices were created by recording the audio of a human voice actor, breaking up their speech into component sounds, and putting them together like letters in a ransom note to form new words. Now neural networks can be trained on unsorted data of their target voice to generate the raw audio of a person speaking from scratch. The end results are faster, easier and more realistic to boot. The quality is certainly not perfect when deploying the machine directly (although manual adjustments may improve this), but they will only improve in the near future.
There is no particular sauce to make these clones, which means that dozens of startups already offer similar services. Just Google “AI text-to-speech” or “AI voice deepfakes”, and you’ll see how mundane the technology is, available in specialty stores that focus only on text-to-speech, like Resemble.AI and Respeecher, and also integrated into companies with larger platforms, like Veritone (where the technology is part of its advertising repertoire) and Descript (which uses it in the software it makes for editing podcasts).
These voice clones have just been a novelty in the past, appearing as one-off fakes like this fake Joe Rogan, but they are starting to be used in some serious projects. In July, a documentary about chef Anthony Bourdain sparked controversy when the creators revealed they had used AI to create the audio of Bourdain’s “talking” lines he wrote in a letter. (Notably, few people noticed the deepfake until the creators revealed it existed.) And in August, startup Sonantic announced that it had created an AI voice clone of actor Val Kilmer, whose her own voice was damaged in 2014 after undergoing a tracheostomy as part of her throat cancer treatment. These examples also frame some of the social and ethical dimensions of this technology. Bourdain’s use case has been decried as exploitation by many (especially since its use was not disclosed in the film), while Kilmer’s work has been widely praised, with the technology being lauded for. to have provided what other solutions could not.
Famous voice clone apps are likely to be the most prominent over the next few years, with companies hoping the famous ones will want to increase their revenue with minimal effort by cloning and praising their voices. One company, Veritone, launched such a service earlier this year, claiming it would allow influencers, athletes and actors to authorize their AI voices for things like mentions and radio identities, without ever having to. enter a studio. “We’re really excited about what this means for a multitude of different industries because the hardest part of someone’s voice and being able to use it and being able to extend it is the individual’s time. “, Sean King, executive vice president of Veritone One, said The Vergecast. “One person becomes the limiting factor in what we do.”
Such apps are not yet widespread (or if they are, they are not widely discussed), but it seems like an obvious way for celebrities to make money. Bruce Willis, for example, has previously cleared his image for use as a visual deepfake in mobile phone ads in Russia. The case allows him to earn money without ever leaving his home, while the advertising company obtains an infinitely malleable actor (and, in particular, a younger version of Willis, straight out of his Die hard days). These kinds of visual and audio clones could speed up the savings ladders for celebrity work, allowing them to capitalize on their fame – as long as they’re happy to hire a sham of themselves.
In the here and now, text-to-speech technology is already integrated with tools like the eponymous podcast editing software built by the American company Descript. The company’s “Overdub” feature allows a podcaster to create an AI clone of their voice so producers can make quick edits to their audio, supplementing the program’s transcript-based editing. As Descript CEO Andrew Mason said The Vergecast: “You can not only remove words in Descript and have it remove audio, you can type words and it will generate audio in your voice.”
When I tried Descript’s Overdub function myself, it was certainly pretty easy to use – although as mentioned above, recording workout data was a bit of a chore. (It was much easier for my colleague and regular Edge podcast host Ashley Carman, who had plenty of prerecorded audio ready to send the AI.) The voice clones created by Overdub are certainly not flawless. They have an odd tone and lack the ability to really load the lines with emotion and emphasis, but they are also unmistakably you. The first time I used my voice clone was a really weird moment. I had no idea that this deeply personal thing – my voice – could be copied by technology so quickly and easily. It sounded like an encounter with the future but was also oddly familiar. After all, life is already full of digital mirrors – avatars and social media feeds meant to embody “you” in various forms – so why not add a talking automaton to the mix?
The initial shock of hearing a clone voice of yourself doesn’t mean that human voices are redundant. Far from there. You can certainly improve the quality of vocal deepfakes with a bit of manual editing, but in their automated form they still can’t deliver the range of inflection and intonation you get from professionals. As singer and narrator Andia Winslow said The Vergecast, while AI voices can be useful for rote voice work – for internal messaging systems, automated public announcements, etc. – they cannot compete with humans in many use cases. “For the big stuff, the things that need breath and life, it’s not going to be like that because, in part, these brands like to work with the celebrities they hire, for example,” Winslow said. .
But what does this technology mean to the general public? For those of us who are not well enough known to benefit from technology and who are not professionally threatened by its development? Well, the potential applications are varied. It’s not hard to imagine a video game where the character creation screen includes an option to create a voice clone, so it looks like the player is speaking all the dialogue in the game. Or there could be an app. for parents which allows them to copy their voices so they can read bedtime stories to their children even when they are not around. Such applications could be made with today’s technology, although the poor quality of fast clones makes them difficult to sell.
There are also potential dangers. Fraudsters have used voice clones in the past to trick businesses into transferring money to their accounts, and other malicious uses are sure to lurk just beyond the horizon. Imagine, for example, a high school student surreptitiously recording a classmate to create a clone of his voice, then faking the audio of that person disparaging a teacher to get him into trouble. If the uses of visual deepfakes are anything to do, where concerns about political disinformation have proven to be largely misplaced but technology has done enormous damage by creating non-consensual pornography, it is these types of incidents that constitute. the biggest threats.
One thing is certain, however: in the future, anyone will be able to create an AI voice clone of themselves if they wish. But the script this chorus of digital voices will follow has yet to be written.
- NVIDIA RAD-TTS Delivers Realistic, More Expressive AI Voices
- Roblox gets voice chat, starting with “Spatial Voice” first
- WhatsApp reportedly developed transcriptions to tame chaotic voice notes
- WhatsApp would have a voice message transcription function: how it works
- How to change Alexa’s voice and spoken language
- Build or Buy: How to Decide to Partner with an ASR Supplier or Create Your Own
- WhatsApp is working on a new feature to turn voicemail messages into text
- LG’s $ 500 UltraGear Portable Gaming Speaker with Voice Chat Now Available in the US
- How to use Live Text in iOS 15