Communication is full of variation and variety. When you’re exposed to unfamiliar languages it is very hard to recognize patterns – often hard even in native languages as people from different regions, demographics, dialects, and being in different environments (say a noisy or a quiet place) and situations speak differently.
In the 19th and 20th century, linguists initially took on this challenge to find order in what looked like chaos by abstracting away from the variation to what is stable and invariant. And the notion of the phoneme was born – the smallest linguistic entity that makes a difference in meaning: change one phoneme in a word, and it will change its meaning.
Fast forward a few centuries to today, where this same concept is being applied to Automatic Speech Recognition (ASR). ASR must find ways to boil all of this variation down to an invariant canonical form, mostly a stream of phonemes and words made up from them. ASR uses a lot of techniques that explicitly address the variation of certain aspects of speech, and then applies machine learning on large amounts of data to teach a model how to understand and make sense of the remaining variation.
One example is “time warping,” which eliminates the differences in speaking speed between different speakers. A technique called Vocal Tract Length Normalization (VTLN) does exactly what its name says: it abstracts away from the fact the length of the vocal tract, which differs in size between individuals, with a predictable impact on the frequency content of the voice signal. Context Dependent (CD) phonemes are used to cope with the variation of phonemes based on their context. Noise suppression algorithms make the speech signal look it had been recorded in a standard silent environment. And so on. The end result is a string of phonemes, a canonical form very much reduced in information content compared to the original speech signal.
The picture below shows the various factors that influence variation, grouped into three dimensions (causes for variation attached to the message itself, the speaker and the environment). And the second picture shows how ASR tries to eliminate all factors but one: what is said.
With all that said, there is one area where we might be too zealous and that’s in the detection of “(meaning carrying) intonation” in the picture above. These are the supra-segmental features of an utterance, which are driven by what the speaker is trying to convey or express with his message, beyond the pure words. Take the intonation pattern, which can turn the same string of words either into a statement or a question. In writing we use “.” and “?” to represent this, but in ASR intonation is abstracted away with all the other parameters, so that “You’re OK.” And “You’re OK?” end up as the same string of phonemes in an ASR system. As these systems become more natural, we may need to bring this level of intonation back into our analysis.
Pauses are also common in intonation and we are actively studying how we bring them back into the scope of machine learning. Simon Boutin, a Master student at ETS, Montreal and an intern here at Nuance presented a poster at Interspeech in Dresden a couple months ago. He has been looking into how you can detect direct and indirect “quotation” in an utterance. Imagine the ASR output for an utterance by a user of an intelligent assistant is “Mary have a nice day.” Did the user actually say:
a) Text Mary: “Have a nice day”, or
b) Text “Mary, have a nice day” (and the system would ask: “to whom do you want to send the text”?
It does convey a meaningful difference, however, ASR output doesn’t have any of the punctuation in it. The task at hand therefore is to find out how to get the punctuation into the string, or in other words, to add some of the supra-segmental “intonation” variation back into scope. In addition to direct quotation the user could also have used indirect quotation, see the table below for a comparison.
Simon found that there are actually cues in the way the user pronounces the utterances, that allow us to put the quotation marks into the right places, one being pitch (measured as f0) and the other, even better, cue being the length of the pauses between words.
What he found was (among other things) that there are longer pauses before direct quotation than before indirect quotation and even more so compared to normal text. Of course, in order for ASR to make use of this it needs to take pause length into account and not abstract it away.
Pauses also played a role in a recent new version of our TTS (Text-to-Speech) – notably for the Korean language. But rather than removing variation to get to a canonical form, in TTS we need to start from the written text and need to add variation in order to come up with naturally sounding output.
As TTS uses a voice talent as the basis of its models, in concatenative TTS systems some of the individual variation (looking at our three axis above again) comes “for free.” (Although in parametric TTS and when using pre-recorded elements in concatenative TTS in a context different from when it was recorded, getting the individual pattern right is still a challenge).
Variation of context again is not a problem (the channel you use to output a synthesized utterance will take care of adding noise); so that leaves the problem of getting the right intonation or expression into what is said. We must analyze the text we are supposed to translate into speech, using NLP methods, and come up with features that inform choices during speech synthesis, like when to pause and for how long. In Korean specifically, pauses are very important; in order to make speech easy to understand pauses need to occur in the right places (e.g. between phrases, never within phrases).
What my colleagues found is that they could improve the perceived quality of a voice (MOS score) dramatically by making sure pauses occur in the right places – see below.
Correct pausing is critical for intelligibility and meaning – which applies to a number of languages, not just Korean. The duration of the pause is also important, and can cue different turns in a discourse. There is some evidence, that for a speaker, durations are quantal in nature, falling into discrete lengths, to help pace the overall utterance.
Finally, pausing doesn’t imply silence; instead, breathing noises play an important role in helping a listener interpret the speech. And other prosodic cues related to the sounds surrounding the pause are hugely significant. For instance, the difference between asserting a question, a declarative, or signaling the end of a meaningful segment but not the utterance
Variation and pausing is the essence of our language in many ways – helping to inform the way we understand emotion, intention, and when we’re looking for a response. Applied to ASR, TTS, and the applications that use them, integrating that variation creates a more natural, humanlike conversation between man and machine – and is among the many fascinating ways that speech technology and machine learning are changing the way we engage with our digital world.