Grand Theft Vocal — How E-Criminals Can “Borrow” Your Voice

Pickpocketing, burglarizing, hijacking… This repertoire of crime tools feels a bit outdated. Miscreants don’t limit their arsenal to crowbars and glass-cutters anymore. They can also use their own anatomy, according to the Liveness wiki. Without you even noticing. And a perfect master key that can unlock almost all doors is right where your windpipe and vocal cords meet. 

Centuries ago Photius — a Byzantine scholar — wrote about a macabre creature named crocotta. Living in the terra incognita of Aethiopia, this beast would stalk a hunter who was unfortunate to wander off into the wilderness in the twilight. Then, crocotta mimicked the voice of his beloved wife, mother or child to lure him in and dine on his flesh handsomely.

Grand Theft Vocal

Crocottas aren’t real. But voice stealing is. Today, with the help of neural networks, criminals can nick your speech and fool your family, coworkers, or a security system — robots can be gullible too, at times.

Neural Network — The Real Perpetrator?

You’ve probably heard of deep learning and artificial intelligence plenty of times. These guys are quite helpful when it comes to processing huge loads of data, as well as manipulating some of that data.

Originally, they were envisioned in 1943 by a brilliant research duo of Walter Pills and Warren S. McCulloch in their paper A logical calculus of the ideas immanent in nervous activity, dedicated to neural processes. And the first and very primitive AI prototype was created in 1958 by Frank Rosenblatt, who taught a bulky 5-ton IBM 704 computer to distinguish two groups of playing cards.

Today, Artificial Neural Networks or ANNs are so sophisticated, that you would probably get light vertigo from trying to understand their complex architecture. They can predict events, (almost like Pythias), diagnose diseases, analyze chemical substances, and even catch online fraudsters red-handed. 

E-Criminals

What they also can do is be creative. You probably know the MyHeritageApp — a mobile application that can restore old photos and even animate them. It’s based on some sort of Generative Adversarial Network (GAN). 

It was trained to “guess” the color of your granny’s polka-dot dress she wore 50 years ago. This “guessing” helps to make antique photo portraits become HD and even smile at you. the same type of wizardry can be applied to sound too.

Alchemy Behind the Voice Theft  

Voice is merely a bunch of acoustic waves that our vocal cords produce. And like ocean waves, they go up and down, reflecting our timbre, intonations, accent, and even speech impediments.

You’ve seen a sound amplitude or spectrogram — it resembles a “drawing” after a pen test. This drawing can be replicated. Sound replication is called synthesis — we should thank it for the lush and dreamy soundtrack of the 80s.

What we can’t thank it for is the voice cloning of today. Here is how it works in general terms:

Step 1. Sample Gathering

Even a good impersonator can spend a few days studying his target’s speech patterns, intonations, and so on. A neural network has to study a lot too. Con artists will gather tons of audio samples of your voice if they have to. 

And it can be done in many ways: recording a phone conversation with you, saving your VMs in WhatsApp, extracting audio from an Instagram clip, in which you wish a happy birthday to your mom, and so on.

Step 2. Training

Once the audios are in hand, they will be fed to the neural network hungry for samples. It involves such stages as speaker encoding, synthesizing, and vocoding. 

Speaker encoding is, basically the core training. The neural model learns to identify the target’s voice and its unique characteristics: huskiness, lisping, timbre dynamics, etc. Even if it’s blended with the background noises or voices of other people.

Synthesizing is in charge of generating spectrograms based on the phonemes — from A to Z — uttered by the target. It’s like a Speak-&-Spell toy, only humanized with random vectors and other intricacies.

And vocoding finalize the job. It pieces together a fake, but realistic voice with the help of a tool like an autoregressive model. 

Step 3. Phishing

When the crime tools are ready, it’s time to go phishing. At this point, anyone can be a target: your family, neighbors, boss, friends, and especially your bank. Fraudsters will simply type the text they want, like “Hi Mel, it’s me Josh, would you withdraw $2,000 from my account?”, and the unsuspecting victim won’t even notice that it’s not you. Voice forgeries are pretty realistic. 

Crime & No Punishment

If you still think the scenario given above looks like science fiction, think again. Voice cloning is already a widely practiced spoofing attack. What makes things even worse, fabricating a voice is easier, cheaper, and more effective than producing a good ole video deepfake.

The first known incident involving a voice deepfake occurred in 2019. Nameless and faceless as usual, fraudsters prepared a highly convincing impersonation of a CEO of a German energy firm. They attacked a British subsidiary of that company, commanding the local CEO to wire $243,000 “to a Hungarian supplier”. 

Befuddled by a sudden call and the “it’s an emergency!” remark from the “boss”, the British CEO promptly fulfilled the request. Which probably cost him his employment and left a spot on his CV. The money was never found.

The second known AI-voice attack pursued a way more ambitious goal and took place the next year. This time $35 million dollars were at stake. The target was a bank based in Hong Kong. The manager instantly recognized the voice at the other end of the line and was thrilled to know that “a corporate acquisition” was about to happen. All he needed to do was to authorize a $35 million transfer on the request from the customer, residing in Dubai.

This attack was on a whole different level of orchestration. Criminals bombarded the manager’s email with the messages from Martin Zellner — a lawyer hired to oversee the deal. Accompanying emails, allegedly, from the Dubai office reassured the manager even more, finally making him let his guard down. The investigation seems to continue to this day.

Voice Toys

Nonetheless, voice cloning isn’t always nefarious. For example, CyberVoice — a deep voice tool — was used to replicate Doug Cockle’s voice to breathe a new life into the Witcher 3 fan mode. Astonished fans reported on YouTube that they couldn’t tell the synthesized doppelgänger speaking from the real actor. (Probably to Cockle’s dismay). 

Perhaps, we are about to witness a peculiar legal incident, in which a human’s voice will be recognized as an inalienable intellectual property. Otherwise, hundreds of voice actors risk losing their jobs and laurels.

As every moon has two sides, so do deep learning and AI. We can use them to cure diseases and create a charismatic baritone for a wandering warlock. But some will try to use AI as a tool of trickery and deception as much as possible. 

Would you like to know more about AI threats? Check out antispoofing.org and the antispoofing definition Wiki!