“Over” was short for “over to you” indicating that it’s your turn to talk on a short wave radio or walkie-talkie (or any half duplex comm tech for you nerds out there). Smart speakers are super cool and a step forward in voice – but they’re still half duplex, klunky, unnatural voice interfaces. We’ll all look back one day and remember how quaint today’s smart speakers were – like we remember morse code, tape players, and VCRs. Try turn-taking a face-to-face conversation or conference call sometime, and you’ll get a feel for what smart speakers, and all voice interfaces for that matter, are missing out on. There’s a whole field of study around the protocols and rules of human conversation called “Pragmatics” that study how humans interact one to one, one to many and many-to-many.
For example – I’ll say, “Alexa, play ‘Fool in the Rain’ by Led Zeppelin on Spotify,” and wait the requisite three seconds of silence so Alexa knows I’m done talking (might be easier to just say “over”). Then Alexa says, “I’m sorry, I can’t find ‘Fool in the Rain’ by Led Zeppelin on Spotify.” I’ll remember I cancelled Spotify and try to correct myself by speaking over Alexa, “No, play it on Amazon Music.” It’s natural to do this – a person wouldn’t miss a beat having the same conversation.
In addition to the half duplex limitation, Alexa also can’t understand multiple speakers. Even the best user interfaces today employ turn-taking to manage the conversations and don’t work at all with more than two speakers in a conversation. For example, if my children interrupt Alexa while she’s playing “Fool in the Rain” and ask her to play “Space Unicorn“, a song that can make you insane after hearing it for the 400th time, I typically respond by shouting, “Laa Laa Laa Laa!!” to confuse Alexa and keep her playing good music.
Managing the turn-taking in a conversation with multiple speakers is no simple task. It requires that you listen while you talk and also respond to visual queues (in a face-to-face conversation). For example, Japanese speakers often produce back-channel expressions such as un or sō while their partner is speaking. They also tend to mark the end of their own utterances with sentence-final particles and produce vertical head movements near the end of their partner’s utterances. See Turn-taking – Wikipedia for a long description of the complexity. The listen-and-talk problem gets exponentially worse when you add more speakers. A bot will need to know whether it’s having a friendly conversation and should wait until the person is done talking, or if the bot is arguing and should cut into the rant.
Recognizing these short-comings is the first step in over-coming them. Nuance R&D is working on these problems and others to transform the way people interface with technology.