Why we’re using Deep Learning for our Dragon speech recognition engine

Everybody is special in how we use language – how we speak and the words we use. And in some cases, the individuality of the speaker matters and can be leveraged to create even better experiences through Deep Learning and Neural Networks – like our latest Dragon Individual and Dragon Legal offerings.
By
Dragon uses deep learning for more accurate speech recognition.

Everybody is special in how we use language – how we speak, the words we use, etc. In an earlier blog post, we saw how speech recognition systems eliminate this variation by training on speech and language data that cover many accents, age groups, or other variations in speaking style you might think of. This creates very robust systems that work well for (nearly) every speaker; we call this “speaker-independent” speech recognition.

But in some cases, the individuality of the speaker matters and can be leveraged to create even better experiences – like our latest Dragon Individual and Dragon Legal offerings ,that are typically  used by one user.  This allows us to go beyond speaker-independent speech recognition by adapting to each user in a speaker-dependent way. Dragon does this on several levels:

  • It adapts to the user’s active vocabulary by inspecting texts the user has created in the past, both by adding custom words to its active vocabulary and by learning the typical phrases and text patterns the user employs.
  • During each session, it does a fast adaptation of its acoustic model (capturing how words are pronounced) based on just a few seconds of speech from the user. By doing this, it can also adapt to how a user’s voice sounds in the moment; for instance are they impacted by a cold, using a different microphone or is there a change in environment.
  • During the optional enrollment step, or later after a dictation session ends, Dragon will do some more intense learning in an offline mode. It continues to adapt models very well over time to a specific user’s speaking patterns.

This latter point deserves more attention. Dragon uses Deep Neural Networks end-to-end both at the level of the language model — capturing the frequency of words and in which combinations they typically occur — and of the acoustic model, deciphering the smallest spoken units, or phonemes of a language.

These models are quite large and before they leave our labs, they have already been trained on lots and lots of data. One of the reasons why Neural Networks have taken off only now and not in the late 20th century when they were invented is that training is quite a computing intensive process. We use significant amounts of GPUs (Graphical Processing Unit) to train our models. GPUs were originally invented for computer graphic applications like video games. Computing images and training Deep Neural Networks have a lot in common as both tasks require the application of relatively simple calculations towards lots of data points at the same time, and this is what GPUs are good at. We use multiple GPUs in parallel in one training session to speed up the training process

But how do we apply this outside of our data centers? Adapting those Deep Neural Networks that make up the acoustic model to the speech coming from the user is similar to training them, and we want to make that happen on the user’s PC, Mac or laptop – and we want it to be fast. It is a demanding task as we need to make sure adaptation works with just a little data and computationally it is a very efficient process.

Packaging this process in a way that allows the individual to run it on their desktop or laptop is the culmination of many years of innovation in speech recognition and machine learning R&D. Enjoy the result of a highly accurate Dragon experience that is fully personalized to you and your voice.

Deep learning powers new Dragon suite

New suite of Dragon professional productivity solutions powered by Nuance Deep Learning technology drive documentation productivity with higher accuracy, speed and efficiency.

Shop now

Tags: , , , , ,

  • undisclosed location

    My main concern is that Vocola/NatPython will continue to work with the new version of NaturallySpeaking. I’m sure you’re very proud of the NaturallySpeaking macro language but, it just doesn’t work well enough anything beyond the most simple speech macros.

    My observations about the NaturallySpeaking macro language start with that it’s virtually impossible to write macros using speech recognition whereas I can create extensions using Vocola/NatPython with speech recognition and not much difficulty.

  • Christian Bayerlein

    Overall, that sounds cool. I hope that will improve recognition speed, too, especiallly in command and spelling mode. Comming from DNS 10, I am very disappointed by later versions. . I experience that especially when I dictate a long string of letters in “command and spelling” mode – which I do a lot, i. e. for programming (I’m a web developer) or typing English texts (I use Dragon in German). With DNS10, I always get an immediate response to the dictation, no matter how many letters I dictate in one go: as soon as I stop dictating, the text is on the screen. With later versions, I have to wait until the recognition is finished, which can sometimes take up to several seconds.

    I also hope that compatibility issues with Win10 Modern UI Apps like the Settings Panel and MS Edge have been solved. Up to now, for example the mouse grid is not working correctly in these apps.

  • Kent

    I am from Liberia, and I fight with NLPs all the time. Will this be my new Cyber friend? I will definitely like to test.

  • http://www.dcneuro.net/ Dr. A. R. Scopelliti

    Is there a reason that the font needs to be so small that you cant actually see it in dragon pad and the correction box? Why isn’t there a control so that the user can set the font size? I have used every version of Dragon from 1 though 13 on the windows side, and 1 through 6 on the mac side. The Mac version pales by comparison. Many, Many grammatical errors, the constant trailing letter appearing when the mouse use is integrated with voice use. And the fonts? WTF? Why would you intentionally make fonts that no one can see? Ive been asking for this to be rectified since you used it up in version 4.

Nils Lenke

About Nils Lenke

Nils joined Nuance in 2003, after holding various roles for Philips Speech Processing for nearly a decade. Nils oversees the coordination of various research initiatives and activities across many of Nuance’s business units. He also organizes Nuance’s internal research conferences and coordinates Nuance’s ties to Academia and other research partners, most notably IBM. Nils attended the Universities of Bonn, Koblenz, Duisburg and Hagen, where he earned an M.A. in Communication Research, a Diploma in Computer Science, a Ph.D. in Computational Linguistics, and an M.Sc. in Environmental Sciences. Nils can speak six languages, including his mother tongue German, and a little Russian and Mandarin. In his spare time, Nils enjoys hiking and hunting in archives for documents that shed some light on the history of science in the early modern period.