Tuning out the noise: How voice recognition works in today’s connected cars

In this blog series, we’ll be looking at how audio can affect the user experience in a voice-enabled car. This first post will cover some of the challenges faced in the automotive environment as well as some technologies used to overcome these challenges. Subsequent posts will cover topics like voice barge-in and in-car-communication.

Input audio in a voice-enabled automobile is a behind-the-scenes phenomenon. The end user doesn’t usually notice the audio chain unless things go awry. It is similar to working as a stagehand on Broadway: a difficult and even thankless job, full of unexpected obstacles, that isn’t noticed until the curtain drops in the middle of Kristin Chenoweth’s big solo.

Audio has a long, sometimes difficult route in the connected car — traveling from your mouth all the way to the speech recognizer “hearing” what you said. In the short version of this journey, there are two halves:

Part 1: Inside the car cabin

The first half of the journey takes you through the interior of the vehicle – from your mouth to the car’s microphone. Unfortunately, cars can be very noisy environments. If you could hear the bumping, grunting, and shuffling of the stagehands, would you really enjoy the play? Think of everything that you can hear in the car: engine revs, potholes, tractor trailers passing on the right, kids playing in the backseat, windshield wipers, climate control noise… and then finally your voice.

Take potholes: A common condition on the Michigan highways that I frequent… You engage the VR system and say “Call Al”. You merge to an off-ramp at just the right time — and the VR system might actually hear “Call <BUMP><BUMP><BUMP>” instead. Competing voices is another common pitfall in the voice-enabled car. While driving with your kids, you attempt to change the radio station by voice.  The VR system now has to interpret what is meant by “DADD…” – “Tune to DAD 100.3 FM” – “…DYYY”. Noise and interferences like these can cause significant misrecognitions and other unwanted behaviors from the VR system.

Part 2: Inside the voice recognition (VR) system

The second half of audio’s journey can be equally difficult. Having a correctly-structured audio configuration within an infotainment system is critical to a successful user experience. During a voice recognition dialog, the system must know when to start and stop listening for the user (the “listening window”). Like stagehands opening and closing the curtain during scene changes, this has a significant impact on the user experience. If the curtain opens early, the audience sees what they shouldn’t. If it closes too quickly, the audience will miss key elements of the plot.  In the case of the car’s VR, this is equivalent to the system hearing “<BEEP> Dial 911” or “Dial 1-800-5//cutoff” respectively. Both situations could cause the user to get an unexpected result.

Other areas of audio configuration also present potential difficulties for the end user experience. A common reaction to a non-working VR system is to speak more loudly with each failure (as humans, we sometimes do this in conversation to make sure we are clearly heard). But what if the audio level in the voice recognizer is already configured at too high a volume? Yelling will only make the problem worse, frustrating the user with multiple failed recognitions. This is why proper configuration and tuning is so important – a scene you can see play out in a demo of Dragon Drive.

The voice-enabled car of the future will selectively ignore driving and passenger noises, allowing a seamless and error-free experience for the operator. Luckily for us, the future is approaching quickly. Today, there are exciting new technologies aimed at addressing some of these common audio challenges. New developments in Digital Signal Processing allow both stationary (like road and fan noise) and non-stationary (like road bumps) noises to be well-suppressed. Other new technologies allow the system to ignore interfering speakers (one variant is called “off-axis suppression”). With this enabled, passengers are able to hold side conversations while you speak voice recognition commands without worry.

So, the future looks bright for voice-enabled cars, but until these exciting technologies are in place, what can be done to enhance experiences with speech in the car? Here are some suggestions: if you are using a system that prompts with a beep – wait for it! Don’t speak over it. Talk in your normal voice, at a comfortable volume (maybe a bit louder if you are in a noisy environment). Remember, like the silent stagehands diligently laboring behind-the-scenes, the audio system in your voice-enabled car is working tirelessly to offer the optimal user experience.

Stay tuned for the next article in this series, which will delve into an audio technology called voice barge-in. With voice barge-in enabled, the user can speak voice commands during prompt playback and even speak over the beep, allowing for a more conversational user experience.

Tags: ,

Connor Smith

About Connor Smith

Connor Smith is a senior audio engineer for Nuance’s automotive business. He started at the Nuance Burlington office in 2011 after getting his Masters in Sound Recording Technology from the University of Massachusetts, Lowell. In 2012, he moved to Michigan to provide onsite support for Nuance’s automotive customers. Much of his work involves Nuance SSE products, including tuning and testing of hands-free systems. He also supports ASR tuning and validation testing for many customers. Connor lives with his wife Becky and dog Peanut, who all enjoy being outside - hiking, golfing, or playing at the dog park.