Hearing is like seeing – for our brains and for machines

In a time when Neural Networks are increasingly popular for advancing voice technologies, language understanding and AI, it’s interesting to remember that many of the current approaches were originally developed for image or video processing. Studying Convolutional Neural Networks (CNNs), it’s no coincidence that the brain uses very similar processes to process both visual and audio/speech stimuli, is it?
Seeing is like hearing for machines and human brains

As noted in previous posts, there is an array of neural net machine learning approaches that are simply more than just “deep.” In a time when Neural Networks are increasingly popular for advancing voice technologies, language understanding and AI, it’s interesting that many of the current approaches were originally developed for image or video processing. One of those methods, Convolutional Neural Networks (CNNs), creates exciting opportunities for advancing the state of the art in voice, and it’s easy to see how image processing neural nets can be applied today to voice when compared to the way we as humans process things in our brains.


What you need to know about CNNs

When people search for visual features, say edges or curves at a lower level or eyes and ears at a higher level (in the example of face recognition), you typically do so locally, as all relevant pixels are close to each other. In human visual perception this is reflected by the fact that a cluster of neurons is focused on a small receptive field, which is part of the much larger entire visual field.  Because you don’t know where the relevant features will appear, you have to scan the entire visual field, either sequentially, sliding your small receptive field as a window over it top to bottom and from left to right, or have multiple smaller receptive fields (clusters of neurons) that each focus on (overlapping) small parts of the input. The latter is what CNNs do. Together, these receptive fields cover the entire input and are called “convolutions.” Higher levels of the CNNs then condense the information coming from the individual lower level convolutions and abstract away from the specific location, as shown below.



Because CNNs originated in image recognition, my colleagues who work in handwriting recognition (a visual task) find CNNs very useful for their work, achieving more than 60% error reduction versus previous methods.

But we have also found several applications of CNNs to speech and language.

For example, my colleague Raymond Brueckner just contributed to a paper published at ICASSP 2016 last month, which shows how CNNs can be applied to a raw speech signal in an end-to-end way (i.e. without manual definition of features). The CNNs look at the speech signal by unfolding an input field with time as one dimension and the energy distribution over the various frequencies as the second dimension into their “convolutions,” thereby learning automatically which frequency bands are most relevant for speech. The higher layers of the network were then used to detect emotions in the speech signal.

The next example is “intent classification” in Natural Language Understanding (NLU), or understanding from a user request what type of task the user wants to achieve (we covered how the other aspect of NLU, named Entity Recognition, works in this post).  For example, in the command “Transfer money from my checking account to John Smith,” the intent would be “money_transfer.” The intent is typically signaled by a word or a group of words (usually local to each other), which can appear anywhere in the query. So, in analogy to image recognition we need to search for a local feature by sliding a window over a temporal phenomenon (the utterance; looking at one word and its context at a time) rather than a spatial field. And this works very well: when we introduced CNNs for this task they performed more than 10% more accurately than the previous technology.


Neighbors in the brain – and in the field

Why are CNNs successful at these tasks?  A rather straightforward explanation could be that they just share characteristics with image processing; they are all of the ‘find something small in something bigger, and we don’t know where it might be’ type. But there may be another, a little more interesting explanation, namely the fact that CNNs designed for visual tasks also work for speech-related tasks is a reflection of the fact that the brain uses very similar processes to process both visual and audio/speech stimuli.

Consider phenomena like Synesthesia, or the “stimulation of one sensory or cognitive pathway lead[ing] to automatic, involuntary experiences in a second sensory or cognitive pathway.”  For example, audio or speech stimuli can lead to a visual reaction. (I have a mild version of this, for me each day of the week, or rather the word describing the day, has a distinct color, Monday is dark red, Tuesday grey, Wednesday a darker grey and Thursdays a lighter red, and so on). It is being interpreted as an indication that processing of audio and speech signals and optical processing have to be so-called “neighbors” in the brain somehow. Similarly, it has been shown that brain areas designed for the processing of audio signals and speech can be used for visual tasks, such as people born with hearing impairments who can re-purpose the audio/speech area of their brains to process sign language. This probably means that the organization of brain cells (neurons) processing visual or audio signals must be very similar.

There is also very practical ramification of the similarity of visual and audio/speech and language processing. We have found that Graphical Processing Units (GPUs), which were developed for computer graphics (visual channel), can be employed to speed up machine learning tasks for speech and language, too. The reason is that the tasks that need to be handled again are similar in nature: applying relatively simple mathematical operations to lots of data points in parallel. So you could say it’s the new developments in computer gaming helped to make the training of Deep Neural Nets feasible.

Clearly, there is really no way to be an isolationist when working in this field. Just as my colleagues who work in handwriting recognition study Convolution Neural Networks, the same can be said for those working to advance Natural Language Understanding. We are essentially specialists just as much as we are generalists. Applying one thing to another to improve a process or technology is the nature of our work. But that shouldn’t be a surprise, right? After all, the human brain works in a very similar way, looking at how it processes visual and audio stimuli. Now it’s not so far-fetched to believe that CNNs, originally designed for vision, will ultimately help machines to listen and better understand us – something that’s crucial as we are continually propelled forward into this new era of human-machine interaction.

Tags: , , , ,

Nils Lenke

About Nils Lenke

Nils joined Nuance in 2003, after holding various roles for Philips Speech Processing for nearly a decade. Nils oversees the coordination of various research initiatives and activities across many of Nuance’s business units. He also organizes Nuance’s internal research conferences and coordinates Nuance’s ties to Academia and other research partners, most notably IBM. Nils attended the Universities of Bonn, Koblenz, Duisburg and Hagen, where he earned an M.A. in Communication Research, a Diploma in Computer Science, a Ph.D. in Computational Linguistics, and an M.Sc. in Environmental Sciences. Nils can speak six languages, including his mother tongue German, and a little Russian and Mandarin. In his spare time, Nils enjoys hiking and hunting in archives for documents that shed some light on the history of science in the early modern period.