Back in the very earliest days of speech recognition machines learned to recognise a small – very small – number of words and act on them. For example, the famous 1962 IBM speech recognition system called Shoebox. It could recognise just 16 words, and perform actions like adding up numbers. (check out The History of Speech Recognition Part 1 and Part 2)
It’s a long way from there to where we are today, with speech recognition systems able to understand tens of thousands of words, put words in the right context using a grasp of grammar, and instruct objects to do what we ask whether that’s dim the lights or compose an email.
However, speech recognition has still more to achieve. Truly conversational speech recognition isn’t here yet but a whole array of service providers want to use it. Online retailers want it to help them deal with customer interactions. Banks want the same thing. Any service that lets you make a booking on the phone wants to use a computerised agent to do that rather than a person, from your sports centre to your GP.
And then there’s the personal digital assistant. While smart speakers are opening up the door to this opportunity for many of us today, the truly versatile personal digital assistant isn’t here yet. That’s the personal digital assistant who we can ask to remind us to call Mum on Thursday evening, set up series record for a new TV show, book a B&B in a particular price range, in a particular location, for a specific weekend, and email us with the detail of what’s on at each of the cinemas within a 5 mile radius of that location on the Saturday evening (with a booking link, and, knowing our personal preferences, with the movie list in our preferred order). That personal digital assistant, which we can trust to get everything right 100% of the time, is not here yet.
To get to that personal digital assistant, developers like Nuance have to build on our deep learning techniques. We need not just to match words and understand their context in a general sense, w. We need to be able to understand what an individual means when they say certain things. We need to get really good at natural language processing – understanding the way individuals express thoughts.
That’s harder than you might think. Two people might use completely different words to describe the same thing, and the people just know they’re in agreement. For a computer to do that is a giant step. A computer needs to be able to be ‘abstract’ in the way it ‘thinks’ about what it ‘hears’, rather than simply matching sounds to patterns it has in its system. It’s the difference between someone telling you what a painting looks like, and actually seeing that painting with your own eyes.
Will we get there? Yes, I think we will. Already deep learning helps Nuance personalise the way Dragon works to better match the ways individuals speak, so that it is faster and more accurate. Already there are Artificial Intelligence (AI) based systems which can extract the important words from a sentence so they can text you a bank account summary or your last three transactions and there’s plenty of research going into improving in this area. We’ve got more computing power at our disposal than ever before, too, and that’s crucial.
One day I fully expect to use everyday speech to ask my personal digital assistant to do all those things I listed above. It’s not a matter of if that will happen, just a matter of when!