Recently, there has been a rising tide of concerns surrounding voice manipulation software. In sum, this technology allows a person to take someone’s speech recordings and create new utterances that the individual may or may not have said. This type of synthetic speech technology has been around for decades, but the entry of a new player in this space, or the publication of a research paper on this topic, tends to create a frenzy of excitement – and anxiety – about the implications of such capabilities on our society. It may interest you to know that, so far this year, there have been over 1 billion voice biometric verifications performed, and yet not a single synthetic speech attack has been successful.
So what is all of the fuss about?
Just like there is photo and video editing software that enables people to create images and videos that blend reality with fiction, voice manipulation software can be used for the same purpose. Have you ever been fooled by a manipulated photograph? I personally like this edited picture of a killer whale attacking a bear. Does it look real? Absolutely. Did it even happen? Clearly not.
Credit: Heart of Vancouver Island
Most of us have been fooled a few times by fake pictures and videos, and have learned that any picture or video can be easily manipulated in today’s modern age. I would argue, in fact, that this happens every day in most lifestyle magazines, as photo retouching has ensured that most of the photographs we see in print are altered versions of the true image. Seeing is not always believing. Manipulating videos has become just as easy; just have a look at this video of a hawk supposedly dropping a snake on a family BBQ. To the 20th century human, this video could have fooled millions. In the 21st century, most of us know better.
So it should come as no surprise that voice manipulation is possible as well. What can one do with software tools that are readily available? For instance, you can create an audio file of someone supposedly speaking words that they never actually uttered. All you need is about 20 minutes of net-speech from the person whose voice you’d like to manipulate. Then with voice synthesis, you can create an audio file of the person saying virtually anything. As with photo and video manipulation, it works best and is most convincing when taking a real phrase and changing maybe a word or two, instead of fully synthesizing an entire sentence which would be easy to detect as forged. For example, if you have a recording of someone saying, “I love coffee,” manipulating the audio to voice, “I love Lucy,” will be more convincing than creating a brand-new sentence such as, “The sky is blue.”
Ethical and security implications of voice manipulation
As you can start to imagine, voice manipulation can be used for nefarious purposes. Creating audio recordings of individuals speaking sentences that they never actually spoke could be used to discredit someone’s reputation, or worse, implicate someone in a criminal act that they didn’t actually commit. So, does voice manipulation have ethical implications? Yes, it does, just like photo and video manipulation.
Beyond ethical concerns, voice manipulation has raised security concerns as well. With the growing popularity of voice biometrics, deployed by banks, telecom providers, insurance companies, and government agencies, questions have arisen regarding how voice manipulation could be used to potentially defeat voice biometric based security layers. A research paper published by the University of Alabama raised this specific concern.
Drivers for biometrics adoption
Organizations have been using biometrics, including voice biometrics, for many reasons. Voice biometrics today allows consumers to log in (or authenticate) to mobile apps without having to type in a password or PIN. Simply speaking a short passphrase such as, “My voice is my password” can validate a person’s identity with a high degree of confidence. The same technology is also used to authenticate customers into contact centers, eliminating the need for hard to remember PINs or worse, the need to answer a series of security questions such as, “What was the name of your best childhood friend?” or, “what was your most recent transaction?” which can be very tricky to answer if the purpose of your call if find out details about your account!
As you can imagine, a primary driver for organizations to deploy biometrics is to improve the customer experience by moving on from outdated authentication methods. Biometrics does this by reducing the time it takes to authenticate and more importantly, reduces authentication failures. We all hate to fail and we also hate to waste our time, and organizations know that eliminating these two irritants has significant benefits to customer retention and overall satisfaction. However, beyond improving the customer experience, a close second key driver for the implementation of biometrics has been to improve security over legacy authentication methods and drive down fraud losses. Here is where voice manipulation software creates a question mark. Does it undermine these security benefits?
Using biometrics increases security
While no security technology is impenetrable, and biometrics is no exception, real-world experience has shown that the technology can effectively detect and prevent voice manipulation attacks that use voice synthesis. Opus Research, a leading analyst firm which follows the industry closely, wrote about this very topic back in 2015, pointing to the fact that voice biometric technologies have anti-spoofing mechanisms to detect such attacks as voice recordings, as well as voice synthesis. Wired wrote on a similar topic in 2016, noting that even the best impersonators can’t fool a voice biometrics system.
Voice manipulation as I’ve described it above is a combination of voice recordings and voice synthesis, which can both be detected due to the audio artifacts that each of the processes generate. To the human ear, a poor recording or a clunky voice synthesis can be easily detected. You may not be able to describe why, but you are able to tell that the voice quality doesn’t sound right, or the voice sounds artificial in some way. Anti-spoofing algorithms operate in a similar fashion, but are more accurate than the human ear. They can pick-up minute audio discrepancies that are caused by recordings or voice synthesis that to the human ear are undetectable.
You can see a presentation outlining voice biometrics vs. synthetic speech below from a recent Nuance customer conference that I attended. Additionally, a colleague of mine this year shared his thoughts about other myths surrounding voice biometrics in a blog post here.
Protect against synthetic speech attacks and decrease fraud losses with biometrics
Organizations have been using voice biometrics to secure banking accounts and confidential data since 2001. As I mentioned earlier in this article, there have been over 1 billion voice biometric verifications performed, and yet not a single synthetic speech attack has been successful. This is a testament to the effectiveness of the anti-spoofing capabilities that protect such systems from such attacks. Furthermore, many organizations have reported significant reductions in fraud losses following the deployment of biometrics. This shows that the technology that fraudsters have readily available to them does not provide them with a simple bypass of biometric security systems.
Clearly academia and the industry at large need to stay vigilant to ensure that anti-spoofing techniques stay ahead of voice manipulation capabilities. As with all forms of security, it’s imperative that we stay one step ahead of those that seek to commit crimes, through continuous effort and innovation.
For the latest news in biometrics, click here.