Back in the 1990s, when you told someone that your company was working on speech recognition for jet pilots they would inevitably say, “Wow that must be difficult, because of all the noise.” And you would say, “Well, yes and no.” Yes, there is noise, but it is very predictable noise (caused by the engine and the wind). This “stationary” noise can be filtered out quite reliably. Plus, the microphone is always in the same place, and positioned very closely to the pilot (e.g. fixed in their oxygen mask). So it actually was simpler than it sounded.
But, the reverse can also be true: something seemingly simple is more difficult in actuality. People who want speech recognition to automatically transcribe what is said during a meeting (because nobody wants to be the scribe!) assume it’s an easy task. It really can’t be that hard to capture a meeting, right? There’s more to it than you think.
A number of variables come into play. First, we have to identify who is talking and where they are located. Conference rooms often feature more than one microphone and the potential speakers may be scattered around them. This includes scenarios where speakers are quite distant from the closest microphone (also causing reverberation to be a problem, a kind of echo from the room walls). So, initially we will not know who is speaking and where they are located with respect to the microphone. To account for this, the system will focus its attention on the active speaker only, working to filter out any background noises and the echo effects mentioned above. As humans, we do that all the time, without thinking much about it. You may have heard this referred to as the cocktail party effect. In an environment where more than one microphone is available we can mimic this capability by applying beamforming technology, which we also use in the car and in home environments.
Related is the task of distinguishing between multiple speakers, because they will alternate over time (which means you need to continually adapt your beamforming). Speaker diarization – or, sorting speech into per speaker buckets – is how we do this. One helpful trick is to make use of voice biometric technology. While its main use case is to authenticate a speaker, you can also use it to identify a known speaker in a group. Once you have succeeded with diarization, you can also use the speech of each individual speaker to adapt the speech recognition models to better reflect their characteristics, similar to how we do it for our Dragon dictation software.
Of course, there may even be times when multiple participants speak at the same time. True, humans typically employ an elaborate ‘turn taking’ system to predict when it is a good time to take over the role of speakers, but as we all know, that doesn’t always work – more often than not, multiple people will speak at the same time during a meeting. This cross talk is the next challenge we are facing, and again, exploiting multiple microphones will help.
Now that we know who is speaking and when (and how), we can start with the actual task: applying speech recognition. This brings about another variable. Often, we will not have previous knowledge of the meeting topic, so our vocabulary will be very large and it will be difficult to predict what will come next, based on context. Recent progress in Language Modeling seeks to do exactly that – predict words based on context – by using Deep Neural Networks.
With these tools in hand, my colleagues working on capturing and transcribing so-called “ambient speech” have recently reported that they are now beating published results on publicly available test sets by a margin. And beyond the lab, we have actually released Nuance Transcription Engine. NTE is primarily targeted at a related use case, transcribing the conversations between call center agents and customers for actionable insights, but it can be used in a wide range of environments for capturing multi-speaker conversations as well.
Even though it’s not as straightforward a task as you may have thought, by combining several different technologies in the right way, we are able transcribe meetings with successful results. The office of the future may have just found its new scribe.