Dragon, do you speak my dialect?

Dialects are a part of who we are, and can represent home to many of us in a world becoming more global in nature. Many wonder though – how do we train speech recognition devices to understand such unique regional languages?
By
Nuance speech technology can understand over 80 languages and their dialects

In a couple of recent blog posts (here and here) we looked into variation between individual speakers and several factors contributing to it, but an important aspect worth diving deeper into in the human language is dialect.

So what is a dialect? A tricky question to answer that can get you into political trouble in some areas of the world! In the past, central authorities were often skeptical of communities that claimed to have its own (regional) language, preferring to rather speak of a mere “dialect.” In the reverse, smaller countries with a big neighbor often insisted they spoke its own language, not just a dialect of the neighbor’s language. Luckily linguistic variation today is often seen as a precious treasure of cultural heritage, but in many places Max Weinreich’s summary that “a language is a dialect with an army and navy” is still valid. Avoiding those issues, I will use “dialect” in a pragmatic way, also encompassing regional languages and accents.

Looking back over the more than 20 years I have spoken to customers and others about Automatic Speech Recognition (ASR) the most frequently asked question definitely was, “Do your systems speak dialect X?”, where “X” may have been Bavarian, Scottish English, Swiss German, Canadian French – and many other examples.

After many centuries of authorities trying to discourage the use of dialects, today many people are actually proud of their ability to speak a dialect. Recall when during a trip to an exotic place you recognized somebody coming from the place you were born in just by listening to how they speak – it’s a welcome feeling. Even governments exploit this today: the German state of Baden-Württemberg (which prides itself of being the birth place of many inventors and scientists, like Carl Benz, Johann Kepler, Albert Einstein; AND is also the home of the Nuance Ulm office) coined the slogan: “We can do everything. Except [speak] High German.”  Obviously the slogan is not quite true in that most speakers of dialects also speak the “standard” form of their language and apply what linguists call “code switching.”  Depending on the social setting, speakers switch between standard language (in a formal setting) to dialect (at home or with friends) and back. Dialect, similarly, can be a tool with which you can signal to somebody they are welcome in your home or that they will remain a stranger, as they don’t speak your dialect. The same mechanism may be at work in those numerous radio spots or YouTube videos where people make fun of ASR, which supposedly does not speak a dialect, see for example here and here.

The second reason why people may have doubts about ASR working well with dialect may also be related to the long history of dialects not being an acceptable language to use in school (at least in some countries). Clearly dialects deviate from the rules of the standard language, as codified in the grammar book, so that somehow encouraged the myth that dialects do not have any rules, are “irregular” and necessarily difficult to capture in a machine. But from a linguistic viewpoint, that is really just a myth: granted, dialects sometimes don’t have a written form, but for linguists spoken language is more important anyways, written language only being a secondary derivation. And in the spoken form dialects are as regular as any other language, they are neither worse or nor more difficult, nor better or easier than “standard” languages.

Machine Learning, especially Deep Learning based on Neural Nets, can deal with the variety of having several dialects and a standard form in one population. As long as you make sure all dialects are reflected in your training data (and we make sure it is, e.g. for the UK we use more than 20 defined dialect regions) the resulting models will reflect all those ways of pronouncing the phonemes (or sounds) of a language. We make sure to include words that are special to a dialect (again using the UK as an example different areas refer to a bread roll as a cob, a barm cake or a bun) and where pronunciation differences go beyond isolated phonemes, we reflect that in the pronunciation dictionary.

For instance, our UK English language pack recognizes 52 different pronunciations of the word “Heathrow” so our airline customers can cater to those whose first language isn’t English. When differences become too big, we create separate models in some cases. Users of Dragon speech recognition software can choose between variations of English and between Flemish (for Belgium) and Dutch (for the Netherlands).

Occasionally this is done “under the hood,” so to speak. Even in the Dragon US English version, there are several dialect models. We use a classifier (another application of Machine learning) to detect which “package” fits best to the user’s dialect and use that for recognizing (if you are interested in a more academic text on how to do this, this PhD thesis is an in depth study of how to deal with Arabic dialects). We also verify that it works by measuring accuracy gains per variant, e.g. Dragon Professional Individual English has an accuracy improvement (over the previous version) of 22.5% error reduction for speakers of English with a Hispanic accent, 16.5% for southern (US) dialects, 13.5% for Australian English, 18.8% for UK English, 17.4% for Indian English and 17.4% for Southeast Asian speakers of English.

Finally, as I mentioned in this blog post we have adaptation to help us with the challenge: dictation software like Dragon will adapt overtime to a user’s specific dialect. When the usage deviates from how we thought it would be used during training, ASR may not work for every dialect at every time. However, speech recognition’s accuracy across a number of languages has risen considerably by upwards of 99%, and is evidenced by the broad and global integration of our cloud based ASR and NLU, used by thousands of apps in cars, IoT devices, smart phones etc.

Nuance global speech recognition usage heat map
Have a look at the heat map above, showing where there are people who use our technology successfully. This also holds when we drill deeper into Scotland vs. England.

Linguistic variety is as important to us as it is important to you; which is why we support more than 80 languages (including regional languages like Catalan and Basque, which we developed in cooperation with regional governments ), and as we have seen in this blog post we do a lot more to cover variation and dialects beyond that number. So, we welcome the challenge of dialect – even if in the form of a YouTube spoof.

Sources:

speech-recognition-usage-by-country © Nuance Communications, Inc.

Tags: , , ,

Nils Lenke

About Nils Lenke

Nils joined Nuance in 2003, after holding various roles for Philips Speech Processing for nearly a decade. Nils oversees the coordination of various research initiatives and activities across many of Nuance’s business units. He also organizes Nuance’s internal research conferences and coordinates Nuance’s ties to Academia and other research partners, most notably IBM. Nils attended the Universities of Bonn, Koblenz, Duisburg and Hagen, where he earned an M.A. in Communication Research, a Diploma in Computer Science, a Ph.D. in Computational Linguistics, and an M.Sc. in Environmental Sciences. Nils can speak six languages, including his mother tongue German, and a little Russian and Mandarin. In his spare time, Nils enjoys hiking and hunting in archives for documents that shed some light on the history of science in the early modern period.