Recently, I shared some thoughts on variation in ASR and TTS, and naturally as a speech scientist, I have more to say on the topic of variation.
In language identification we eliminate all variation parameters except the language in which a message is spoken; the only goal is to classify bits of speech into one of several languages, based on their specific characteristics. My colleagues who work on Voicemail-To-Text transcription use this to detect voicemails in a foreign language, because these need to be routed to specialized transcription systems for these languages (see the right side of the image below).
When using Voice Biometrics for tasks like speaker identification and speaker verification (see left side of the image above), we try to eliminate the same environmental variation (noise, channel, etc.) as in automatic speech recognition (ASR) but for the other two dimensions the picture is complementary: while ASR eliminates speaker variation to interpret content (what’s said), voice biometrics (VB) tries to keep the content constant (explicitly so in text-dependent VB, where the speaker is asked to always say the same text) to determine who said it. So, VB tries to discern the speaker identity, but still tries to eliminate finer-grained aspects attached to the speaker. For example, it tries to be immune to aging of the speaker and her/his voice and health conditions (e.g. you still want to identify a speaker when s/he has the flu). As my colleague Kevin Farrell puts it, “ASR is trying to get rid of the information needed for VB, while VB is trying to get rid of the information needed for ASR.”
He also described to me how a recent trend has reversed this a bit: The main technique in modern day VB systems is the identity vector, or ‘iVector,’ that reduces the information of a speech sample to (typically) 400 floating point values that are specific to an individual and, in general, are independent of the text content of the speech sample. What we’ve learned, though, is that variation due to textual content can actually be beneficial to VB accuracy; one way to incorporate it back in is through Deep Neural Networks (DNNs).
Machine learning based on DNNs basically allows us to perform higher-resolution VB modeling by providing us with information as to the phonetic content of an utterance. Hence, when performing a VB comparison between two speech samples, contextual information will be used in addition to the acoustic information, leading to a sizable improvement in accuracy (you can experience it in the recently launched voice biometrics 10.0 and NVSL 5.0 products). So, in short, while one milestone technology – namely the iVector – essentially removed variation from content, the next generation technology – namely DNNs – has led to improvements by adding it back in.