Just be yourself: More on variation, voice biometrics, and the science of voice technology

Variation is inherent in language and this can come in many forms: background noise attributed by a physical environment or even a raspy voice when the speaker has the flu. Two more variation forms are antagonists when you compare ASR and Speaker Verification: In ASR you try to eliminate the speaker’s characteristics in order to determine the content and in SV you eliminate content to determine the speaker’s identity. But as with ASR, we’ve found it can actually be beneficial to add that content variation back in for SV – something done with Deep Neural Networks.
By
Variation can improve accuracy of speaker verification for voice biometrics

Recently, I shared some thoughts on variation in ASR and TTS, and naturally as a speech scientist, I have more to say on the topic of variation.

In language identification we eliminate all variation parameters except the language in which a message is spoken; the only goal is to classify bits of speech into one of several languages, based on their specific characteristics. My colleagues who work on Voicemail-To-Text transcription use this to detect voicemails in a foreign language, because these need to be routed to specialized transcription systems for these languages (see the right side of the image below).

add-remove-speaker-variation

 

When using Voice Biometrics for tasks like speaker identification and speaker verification (see left side of the image above), we try to eliminate the same environmental variation (noise, channel, etc.) as in automatic speech recognition (ASR) but for the other two dimensions the picture is complementary: while ASR eliminates speaker variation to interpret content (what’s said), voice biometrics (VB) tries to keep the content constant (explicitly so in text-dependent VB, where the speaker is asked to always say the same text) to determine who said it. So, VB tries to discern the speaker identity, but still tries to eliminate finer-grained aspects attached to the speaker. For example, it tries to be immune to aging of the speaker and her/his voice and health conditions (e.g. you still want to identify a speaker when s/he has the flu). As my colleague Kevin Farrell puts it, “ASR is trying to get rid of the information needed for VB, while VB is trying to get rid of the information needed for ASR.”

He also described to me how a recent trend has reversed this a bit: The main technique in modern day VB systems is the identity vector, or ‘iVector,’ that reduces the information of a speech sample to (typically) 400 floating point values that are specific to an individual and, in general, are independent of the text content of the speech sample. What we’ve learned, though, is that variation due to textual content can actually be beneficial to VB accuracy; one way to incorporate it back in is through Deep Neural Networks (DNNs).

Machine learning based on DNNs basically allows us to perform higher-resolution VB modeling by providing us with information as to the phonetic content of an utterance. Hence, when performing a VB comparison between two speech samples, contextual information will be used in addition to the acoustic information, leading to a sizable improvement in accuracy (you can experience it in the recently launched voice biometrics 10.0 and NVSL 5.0 products). So, in short, while one milestone technology – namely the iVector – essentially removed variation from content, the next generation technology – namely DNNs – has led to improvements by adding it back in.

 

Tags: , ,

Nils Lenke

About Nils Lenke

Nils joined Nuance in 2003, after holding various roles for Philips Speech Processing for nearly a decade. Nils oversees the coordination of various research initiatives and activities across many of Nuance’s business units. He also organizes Nuance’s internal research conferences and coordinates Nuance’s ties to Academia and other research partners, most notably IBM. Nils attended the Universities of Bonn, Koblenz, Duisburg and Hagen, where he earned an M.A. in Communication Research, a Diploma in Computer Science, a Ph.D. in Computational Linguistics, and an M.Sc. in Environmental Sciences. Nils can speak six languages, including his mother tongue German, and a little Russian and Mandarin. In his spare time, Nils enjoys hiking and hunting in archives for documents that shed some light on the history of science in the early modern period.