As movies like “Ex Machina,” “Her,” “The Imitation Game,” and others continue to hit the big screen, we are also seeing a lot of excitement around “deep learning.” Just for fun, I entered “applies deep learning to” into a well-known search engine and according to the hundreds of results, “deep learning” is being applied to: “satellite images to gain business insights,” “differentiate disease state in data collected in naturalistic settings,” “the task of understanding movie reviews,” “emotion prediction via physiological sensor data,” “Natural Language,” and – probably my favorite – “the tangled confusion of human affairs” (I guess I am not the only one who would claim that the two phenomena are related). Deep learning can seemingly be applied to all of these different areas, and yet, we must first answer, where did this “deep learning” idea come from; what does it mean? To start, I think it has to do with metaphors.
Today, we will take a deep dive and see how metaphors can be powerful tools to guide our minds into new insights – but also to lure them into fresh misconceptions.
Metaphors are everywhere; I easily crammed at least seven into the last sentence (marked in italics). To recap, a metaphor applies words and concepts belonging to a certain field in order to talk about a quite different field. Take “Elvis is the King of Rock and Roll.” Strictly speaking, Rock and Roll is no kingdom, but by applying the word “king” to it we mentally form it into one: with different ranks of characters and huge masses of underlings hailing their betters (i.e. the fans). Instead, one could say “most dominant artist in” or invent a new word with that meaning. But, the first alternative is clumsy and the second leaves us with an abundance of words. We would then be faced with a similar quandary when trying to find a new word for “King of Pop” to describe Michael Jackson. Evidently, metaphors are not a decorative element in elaborated speech, but instead, they are an economical instrument to save words and effort, by recycling old words to new context, en suite with all the associations they bring with them.
Part I: The “learning” in “deep learning”
According to most dictionary definitions, “learning” – defined by Miriam Webster as “to gain knowledge or skill by studying, practicing, being taught, or experiencing something” – is something that humans do. So when attaching the word “learning” to things such as animals, substances, or even device systems in our Internet of Things world, this is already metaphorical, as it applies a human concept, involving consciousness – something that these other things don’t have. You may have heard about so called shape memory alloys. Things made from these metals have an interesting feature: when you bend them from their current form into a new one, and then heat them up, they will revert back to their original form. It’s tempting (and also helps with conceptualizing) to describe this behavior in metaphorical terms. Wikipedia employs this method to describe shape memory alloys, conveniently marking their use of metaphors with quotation marks:
Training implies that a shape memory can “learn” to behave in a certain way. Under normal circumstances, a shape-memory alloy “remembers” its low-temperature shape, but upon heating to recover the high-temperature shape, immediately “forgets” the low-temperature shape
I don’t think anybody would take any of this at face value and assume the metal atoms have little brains, “learning” and “remembering” something. But how about computer programs, that do “Machine Learning?” Is it also purely metaphorical “learning” that they do? Or, are they complex enough to display true learning, like humans do? And why don’t we reject this latter idea right away, like we do for the memory alloy?
One reason of course is that computers are more complex, and many people don’t understand them well. Computers were being spoken about in metaphorical terms right from the start, referred to by the media as “electronic brains” in the 1950s and onwards. Then, Science Fiction took over and, not being tied by “technical feasibility” and other boring details, presented us with an abundance of “thinking” machines and robots that subsequently took hold in popular culture. There, they met concepts of “artificial life” rooted in western culture, from the golem made from clay, over the homunculus of the medieval alchemists, all the way to Mary Shelley’s Frankenstein.
In order to decide if “Machine Learning” (ML) really learns or just “learns,” here’s a quick primer on the subject. Let’s start with a mathematical model that formed the backbone of many ML systems for many years, Hidden Markov Models, or HMMs. Looking at the image to the left, they seem to be of modest complexity, made up of states (a) and possible transitions (a) between them, which come with certain probabilities and mappings (b) to input states (y). Probabilities are “learned” by the model in “training” on many samples of whatever the models are supposed to represent, for instance words or their acoustic building blocks, called phonemes. I think we can agree that “learning” is metaphorical here, as these models aren’t really that different from the atoms of the memory alloy above. However, a few years ago in mainstream ML, HMMs were swapped out and replaced by a different model type, one that first gained popularity in the 1990s. This replacement nearly disappeared after a while but is now making a forceful return to the stage (we’ll discuss why a little later). Problems start with the name of the model: “Neural Network” (NN). As you can see in the figure to the right, it is comprised of layers of nodes, and these nodes are supposed to be “inspired” by neurons, such as what we find in a brain: they have input coming in through the arrows on the left, similar to how neurons get (electrical) input through their dendrites, then some calculation happens, and the resulting output leaves to the right (and becomes input for the next layer) such as through the axon of a neuron. The calculation in the body of the “neuron” is typically rather trivial, like taking the maximum of inputs, or summing up inputs. In order to use this for a ML task – for example, image recognition – you assign each input node to a pixel of a, say, black and white picture, and each output node to a category of an object you want it to be able to recognize (“tree,” “cow,” etc.). Then, like with HMMs, you train the model with pictures and known correct results (picture shows a cow) by setting input and output nodes to the appropriate values. Then you apply a method called “back propagation” that calculates from-right-to-left (at runtime, your model works left-to-right, from input to output), adjusting probabilities attached to the arcs, so that if input of this nature occurred, it would result in triggering of the correct output when calculating left-to-right.
As you can see, the whole model is not much more complex than HMMs, or at least not so much as to justify that all of a sudden we should accept that Neural Networks can think or learn. Granted, real models have more nodes (several thousands), but still, the differences to real neurons in real brains are fundamental: the latter still has many more neurons, works in an analog way, combines electrical with chemical and even genetic effects, and yet, we don’t know how things like consciousness come about. In my view, “Neural Networks” aren’t any more (or less) likely to mimic brains than HMMs. But because of the “neural” in the name (unfortunately, alternative names, as “perceptron,” with “perception” in it, aren’t much better), the model carries a big rucksack of metaphorical meaning like we saw above: isn’t it natural that an “electronic brain” using artificial “neurons” can “learn?”
Part II: The “deep” in “deep learning”
But wait – that doesn’t even take the “deep” in “deep learning” into account. In its verbatim meaning, “deep” is tied to spatial relations. Water can be deep, especially in lakes and seas. Other uses are nearly always metaphorical, like with the colors “deep red” and “deep blue” that are intense and/or dark variants, nothing more. And of course combinations with “thinking” are quite popular: you have “deep thinkers” coming up with “deep thoughts” after a “deep dive” into the matter.
In his “Hitchhikers Guide to the Galaxy,” Douglas Adams names a computer (it works for 7.5 million years to return “42” as the answer to the only question it can answer) “Deep Thought.” Later, a student and future IBM employee borrowed that same name for his real-life chess-playing computer, which gained fame in 1996 by beating world champion Gary Kasparov. And then, someone in IBM’s marketing department rebranded it as “Deep Blue” – here you see the whole metaphorical path from “deep sea” to “deep blue” (like the sea, but also IBM’s logo) into deep “thinking” in just two words. In actuality, there isn’t much that is “deep” about a chess-playing computer: the algorithm is fairly “shallow” and brute force; “Deep Blue” was successful because it used a lot of hardware for calculating possible moves and chips cleverly designed to help with evaluating positions. Nevertheless, the “deep” was here to stay. (For what it’s worth, Deep Fritz is a competing chess computer that is still commercially available today.) DeepQA was another foray into the world of “deep” things: text passage retrieval of mostly Wikipedia-originated text collections and ontologies packaged up so that a machine could beat the human challengers in Jeopardy (under the name Watson).
The scene was set years ago when Machine Learning researchers decided they would expand the middle, “shallow” hidden layer in a neural net to multiple layers and make the nodes a bit more complicated, naming the result “Deep Neural Networks” (DNNs), or “Deep Belief Networks.” How would the unassuming, non-technical person hearing about “Deep Belief Networks” not apply the metaphorical associations they have been used to and connect this with “deep thinking” and Artificial Intelligence?
OK. I think we have dissected this and demystified sufficiently to see that what we have in front of us is a mathematical way of modeling that is not completely different from HMMs. And hence, despite the name, these systems can only “learn” in a metaphorical way. So, does it mean we should think lowly of it? Not at all!
First of all, DNNs help us, at Nuance, to drive accuracy up (and error rates down) for our core Automatic Speech Recognition (ASR) engine – the technology behind our cloud-based offerings as well as inside Dragon NaturallySpeaking, which is currently in its 13th generation. Over the last twenty years, error rates continue to decrease version after version. Achieving this within the HMM framework was getting increasingly difficult in the end as this framework had been tuned and improved over decades and headroom was getting smaller. So not only did DNNs drive error rates down at once, but because there is such a huge space of largely untested possibilities under the umbrella of “DNNs,” like different topologies, numbers of layers and nodes, how the nodes are structured, how they are trained etc., they promise a lot of potential for the years to come. And in speech synthesis, DNNs improve the mapping from the linguistic features of the text to be synthesized into the acoustic parameters of the target speech, like prosody. In voice biometrics, they help improve accuracy of speaker authentication. With all this in mind, it is no overstatement to say that DNNs were the single largest contributor to innovation across many of our products in recent years.
When I spoke of DNNs as not being complex (in the sense that it is hard to see how consciousness and true intelligence would hide in them), I did not mean that they were easy to find, or better, easy to get them to work. Again, quite the contrary. As mentioned before, Neural Nets were around already in the 1990s, but, two problems limited their success back then. For one, when you want to train them on large data sets, and when the number of nodes and layers is non-trivial, the training will take very long – prohibitively long on hardware as it existed back then. Moreover, the training can end up in a model that is better than similar models “in the vicinity,” but if you look at the global search space, there would have been quite different and much better configurations. Whether or not your training ended up in such a “local optimum” or truly found the global optimum depended on random factors during early stages of the training process. The breakthrough of DNNs was made possible when both problems were solved by pioneers such as Geoffrey Hinton, Yoshua Bengio, and many others. Of course, better hardware helped them solve the first problem, but it was clever ideas about how to parallelize the work better and how to use Graphics Processing Units, or GPUs, (i.e. special chips originally developed for computer graphics) that took them farther. Even before, the problem of local minima was solved by introducing the concept of pre-training, that is, a processing step that pre-sets the model into a state that is more likely (and faster) to end up in a global optimum than when starting from scratch.
The great thing is not only that these problems were solved and that Neural Networks now work in general. They have also opened up fields for additional research which promises more improvements for the future. GPUs get ever more powerful, driven by the games industry, and DNNs get a free ride. Speeding up training times is not only important for practical applications, indirectly it also helps progress on the algorithmic side: when DNN trainings took several weeks or months to complete on meaningfully sized data sets (as they did until a few years ago), experiments were very costly and progress was slow. Now that you can turn around these trainings in days or even hours, it is much easier to test new ideas.
Even with all of this progress, I, alongside other researchers, acknowledge that more work needs to be done. For example, using GPUs for all training steps of a DNN is a challenge because of the intertwined nature of the network. Because the output of a “neuron” potentially depends on many other neurons and the input data, and the training is not a purely local matter (and hence easily parallelizable), a lot of data needs to be transferred between compute nodes, potentially eating up the time advantage of the GPUs. How will we solve that? Also, when DNNs first took over the “backbone” of speech recognition, the speaker independent model trained off of a large quantity of data that reflected nearly all the variety of dialects, and individual speaking styles possible. The challenge here is that most practical systems use asecond, speaker-dependent training method that adapts the base model to the specific speaker. Depending on if you only have a few seconds to train on or hours of speech samples to pull from, different methods have been used. As all of these were developed for HMM-like base models they now need to be adapted to DNNs.
And so on, and so on.
Clearly, a lot of work awaits us still in the field of DNNs, but with that, a lot of excitement, too. Even if we don’t get carried away by the metaphors around “Deep Learning.”