Deep Neural Nets (DNN) having taken over Machine Learning these last few years, driving headlines and discussion within the industry and the media. That said, we’re just scratching the surface with Neural Nets, which are evolving and changing with many different approaches and challenges to solve.
“Standard” DNNs are unidirectional: information flows in just one direction, from the input layer through the hidden layers to the output layer. In Machine Learning lingo, these DNNs are of the “feed-forward” type. They are best when all the information needed to learn is available at the same time. Think of image recognition: the image is available at once and the network can decide what it sees in it in one look, or in this case, in one pass through the network.
The teams here at Nuance are applying DNNs to advance speech recognition and natural language understanding as part of our mission to better facilitate communication between people and technology.
One of the interesting challenges with speech is that, as opposed to vision, it is embedded in time. An utterance unfolds over a number of seconds. And what happened a few minutes ago or even a few seconds ago helps to understand what is happening right now – better known as context. Technically speaking of course if you waited until the end of an utterance you could make the whole utterance available to a DNN at once and a feed forward network could access all the info it needs to do the recognition job in just one go. The problem is that for dialog systems, like personal assistants, you cannot do that. As speech recognition is a heavy-duty computing job, engines start working right after an utterance begins and tries to keep up with the speaker, as to quickly offer up a response once the speaker is done talking, just like in a conversation between people.
As a result, the speech recognition engine will look at a slice of speech at a time. And to memorize the context we at Nuance use a special variant of DNNs, the so-called Recurrent Neural Nets (RNNs).
Their neurons take input not only from the left (as shown above, left hand side), but they also have access to their own *previous* state (or in variants even that of other neurons, see above, right hand side). These feedback loops form a kind of memory.
Let’s look at Language Modelling to illustrate that: Language Models (LM) predict the next word based on the last so many words (where ideally we would not have to define a fixed number of words, it should be variable). For example, if you have already heard “God save the“, then “Queen” is a much more likely continuation than most other words. What we have found is that LMs based on RNNs work significantly better than traditional LMs.
Now let’s look at Natural Language Understanding (NLU), or mapping the recognized words via speech recognition to meaning.
One sub task is to identify “named entities”. For example, in an enquiry like “Is there free capacity at the parking garage next to Boston South Station?” the two blocks of underlined words are such named entities. So in a first step we want to label the words of the utterance as belonging to such named entity expressions or not. A decade or two ago such a task would have been handled by HMMs (Hidden Markov Models), for example, the old work horse of Machine Learning, also used in voice recognition before DNNs. But since then, another mathematical model took over, that is especially good at such labeling or tagging tasks (a sequence of items, in our case words, is mapped to a set of labels).
This model is called CRF, or “Conditional Random Fields”. In contrast to the previous task we looked at (speech recognition) for NLU we can afford to wait for the entire utterance to be available. The benefits of being able to look at all the words at the same time outweighs the little delay caused by the NLU processing step, which is very fast compared to ASR. CRFs easily outperform HMMs on tasks like NER.
They have one soft spot, however. It takes a bit of manual work to tell them what they should watch out for in the input data (the sequence of words). Should they look at just the words at face value or also their grammatical type? The neighbors left and right, and how far out? But this so-called feature selection is something that neural nets are good at: they evolve to learn on their own what the most valuable features are.
So why not combine CRFs with NNs? That is exactly what our NLU team at Nuance has done.
In this model – “NeuroCRFs” – the NNs do the feature induction part, and the CRFs do the “rest”. We found that RNNs, with their built-in memory function, work especially well in combination with CRFs. This is because they can “remember” a context of variable length, whereas other NNs would force us to arbitrarily define a context window size. Together with some clever tricks and optimizations the resulting models can outperform an already good CRF baseline by more than 10 percent better accuracy. (Two of my colleagues spoke about this in (much) more detail at ASRU 2015 in December: Marc-Antoine Rondeau, Yi Su: “RECENT IMPROVEMENTS TO NEUROCRFS FOR NAMED ENTITY RECOGNITION,” in Proc. Of ASRU 2015, pp. 390-396.)
The take away
First, while it is true that Machine Learning, especially DNNs, are good at many tasks it doesn’t mean that the same exact type of DNN is the best answer in all cases. For that reason, there is a lot of hard work in research to find the best “net” to catch each proverbial fish, or in this case, a task.
Second, as an end user you have no way of telling, which technology you are talking to. When you called various ASR and NLU-powered systems over time you may have spoken to the different generations of technology. But the only difference you would have noticed was how the systems were becoming more accurate and more powerful all the time. And in the era of Machine Learning, with ever more data being turned into “big knowledge” it will not stop there.