Deep learning, coming to a car near you

If you have been hearing about the latest breakthroughs in artificial intelligence (AI), you have likely also heard a lot about Deep Learning or Deep Neural Networks. Some believe it is a game changer, some say it’s hype. But one thing is certain: these techniques have brought us increased accuracy in specific domains such as speech recognition. Once constrained to high-powered server farms, Neural Net capabilities have made their way into embedded speech recognition engines running on devices and now within automotive systems. The result? An in-car system that better understands your commands, and drives a more conversational experience.
By
deep learning connected car echnology

The idea of creating computing systems inspired by the human brain is more alive than ever, as illustrated by the Human Brain Project in the EU Horizon 2020 program [6]. Artificial neural networks (ANNs) are an example of such systems and have been explored for many decades. They are networks of many thousands of simple, but massively connected nodes, whereby each node generates an output from a weighted sum of its inputs (Figure 1). Whereas the brain deep-neural-networks-connected-car-figure-1learns and stores information by growing connections between neurons, these ANNs store information in these weights. Algorithms have been developed that are able to adapt these weights based on large amounts of examples presented to these networks, in order to perform classification tasks with high precision. This process is referred to as “learning”. Just as we all have learned by observation and repetition, massive amounts of (good) data are the fuel these learning algorithms thrive on. In fact, copious amounts of data is one of the main elements of ANN success – more to learn from and with which to build the algorithms that are capable of training such networks.

 

Applied to speech

In the early nineties, ANN technology has been applied to the speech recognition problem, but initially had a hard time competing with the more classical generative multi-gaussian models. [2] This started to change as of 2006, not coincidentally at the time when computing chips started going multi-core (Figure 2). The emergence of powerful computing environments and the accompanying development of new learning algorithms have driven the world of artificial intelligence into further exploration of techniques referred to as Deep Learning, Deep Neural Networks (DNN) and Deep Belief Networks. These became particularly relevant in areas of computer vision, speech recognition and natural language processing. These enable powerful computers to model and learn complex functions from massively available data. Their usage as classification networks turned out to be very well suited for discriminating phonemes, the basic abstract representation of spoken language. Hybrid DNN-HMM (Hidden Markov Models) systems emerged, using the discrimination and generalization power of neural networks on specific problems such as acoustic modeling or speech detection, within an eco-system of proven technology. This enabled us to deploy them in a seamless, non-disruptive manner, for example, without the need to overhaul how developers build applications with their traditional speech recognition engine.

The structure of simple computing nodes is ideal to be serviced by massive multi-core architectures one can find in graphical processing chips (GPU), especially in the learning phase. This massive power is required to handle the large amounts of data these systems require to learn and generalize, rather than to just memorize the data itself (called overfitting). How exactly they do this is the subject matter of many researchers in this field. This understanding will undoubtedly lead to more powerful learning algorithms and more powerful systems.

CPU-GPU performance over time

The training of DNNs is by far the more computer processing power demanding part. I remember the special hardware in our labs to train small neural networks in the nineties [3] , on data amounts that are minimal compared to today’s standards. It was a mini-computer stuffed with bespoke boards with C30 DSP processors, connected in a ring architecture, designed to compute complex matrix manipulations at a speed not feasible on any standard hardware. Moore’s law, providing us with increased computational power over the next decade, made such hardware obsolete, but still could not satisfy DNNs hunger for more. The availability of powerful GPUs designed for the type of computations needed to train such networks reduced the learning time from weeks to days. The right hand side of Figure 2 shows how GPUs offer an order of magnitude more capacity for the types of operations typically used during training. Once trained, using the networks for decoding is significantly less demanding and can run on general purpose CPUs, though the usage of GPU chips is often called upon as well for larger networks. The technology started to emerge in the cloud, powering for example Nuance speech recognizers people used to operate their smart phone, TV or car.

 

DNNs, coming to a car near you

At Nuance, these techniques have been explored and today form an important part of the research activities. Recently, they have not only found a place on our server farms for our cloud operations, where compute power is abundant, but also on our embedded and hybrid speech recognition platforms. While we started to see the classical GMM-HMM performance levels reaching their performance limits, deep learning opened up a new path towards increasing accuracy. They brought us big gains when we introduced them, and created a path towards further improvements in the next couple of years.  Running such networks on embedded platforms was just a matter of time, waiting for platforms powerful enough to arrive (such as decently powered GPU co-processors for that market segment).

We recently released a new version of our embedded automotive speech recognition engine that features enhanced accuracy and performance as a result applying DNNs to our technology.  The next generation of automotive deployments coming to market in the comings months and years will benefit from the advantages of deep neural network algorithms. Standard ARM based platforms can be found abundantly on in-car dashboard systems. Optimized with patent pending techniques, these platforms already benefit from the enhanced modeling power of DNN technology.

Error rate reductions of 40-50 percent compared to previous generations are expected to open up new ways in which people will interact with their cars using speech: more natural, more accurately servicing a wider range of drivers. As a result, there is a more intuitive interaction with the car that is seamless and inherently, minimizes distraction from manual-visual interactions that can successfully be conducted through speech.  The demand for ever increasing accuracy, the ASR variant of Moore’s law (Figure 3) is continuing, as well as the increased robustness to adverse conditions such as driving cars or use of devices in public places.

exponential error rate reduction

As devices and platforms see their computing power grow, we will see an increased usage of these techniques for solving specific problems in the speech recognition and natural language understanding area. Acoustic modeling already demonstrates the benefits. Language modeling and intent classification for natural language understanding, and even integration with camera vision are candidates to come next. Complementing with other techniques will still be needed. Though having a hammer does not imply every problem is a nail, DNNs for sure give us a great power tool in our toolbox that gives the us the ability to build a smarter, safer in-car experience for drivers all over the world.

 

References

[1]    Richard  Socher  and  Christopher  Manning, Deep  Learning  for  NLP, Stanford University, http://nlp.stanford.edu/courses/NAACL2013/NAACL2013-Socher-Manning-DeepLearning.pdf

[2]    Morgan, Bourlard, Renals, Cohen, Franco, “Hybrid neural network/hidden Markov model systems for continuous speech recognition.” ICASSP/IJPRAI, 1993

[3]    Morgan et al., RAP: A Ring Array Processor for Multilayer Perceptron Applications, ICASSP, 1990

[4]    Geoff Hinton, Keynote talk: Recent Developments in Deep Neural Networks. ICASSP, 2013

[5]    R. Raina, A. Madhavan, A. Ng., Large-scale Deep Unsupervised Learning using Graphics Processors, Proc. 26th Int. Conf. on Machine Learning, 2009

[6]    The Human Brain Project, https://www.humanbrainproject.eu

[7]    Jeremy Howard, TED talk on the applications of deep learning and future consequences [1]

Tags: ,

Bart D'hoore

About Bart D'hoore

Bart D’hoore is Director of Embedded Automotive ASR at Nuance and leads the research team working on embedded speech recognition and natural language processing at Nuance. The main product focus is VoCon, the automotive industry reference embedded ASR engine. Bart holds a Master degree in engineering from the University of Ghent. Bart has over 20 years of experience working in various research and research manager positions focused on acoustic modeling, languages and compute grids, servicing various markets. Bart started his career at Lernout & Hauspie Speech products. Bart has participated in several international projects (sponsored by EU, IWT, NTU) (SpeechDat, STEVIN, Autonomata, Get Home Safe) and has several publication and patents.