Can we build ‘Her’?: What Samantha tells us about the future of AI

The movie Her has captured the public imagination with its vision of a lightning-fast evolutionary trajectory of virtual assistants, and the emotional bonds we could form with them. Is this a likely future? Nuance CTO Vlad Sejnoha weighs in on what to expect.

What will the next generation of intelligent computing look like?

The movie Her has captured the public imagination with its vision of a lightning-fast evolutionary trajectory of virtual assistants, and the emotional bonds we could form with them. Is this a likely future?

The film’s narrative arc shows the evolution of the Samantha operating system and her relationship with her user, Theodore, transforming from a competent assistant, to a literary agent that proactively arranges the publication of Theodore’s letters, to an ideal girlfriend, and ultimately to an entity that loses interest in humans because they have become unsatisfying companions. Throughout, Samantha is an impressive conversationalist with a perfect command of language, a grasp of the broader context, a grounding in common sense, and a mastery of the emotional realm.

This is a dizzying progression, but even Samantha’s first, strictly utilitarian incarnation is impressive. Her speech recognition, natural language understanding, speech generation, dialog, reasoning, planning, and learning all far exceed the current state of the art. She is able to take on complex tasks — she filters Theodore’s inbox with a sophisticated understanding of the goal — and is able to engage in flexible reasoning without any obviously predetermined responses.

In contrast, today’s virtual assistants engage in simple dialogs and produce scripted “chat.” They are capable of limited predictive and proactive behavior, but learn relatively slowly, and mostly automate “one-shot” commands — placing or directing calls, making appointments, finding directions, sending messages, and performing searches.

What will it take for today’s virtual assistants to become more like even the ‘out-of-the-box’ Samantha, the hyper-efficient aide?


Communication and context

Let’s start with an assistant’s need to understand higher-level goals, and becoming better at filling in the blanks with implicit information without requiring explicit step-by-step instruction.

Such an assistant would act on our instructions to “Book at table at Zingari after my last meeting tomorrow, and ask Tom and Brian to meet me there,” without us having to break this “meta-assignment” into its constituent sub-tasks. In executing the sub-tasks, the assistant would attempt to overcome a variety of possible obstacles and present us with reasonable choices, rather than simply reporting a failure.

The fact that the request is expressed in natural language is no accident: the ability to understand natural language is central to an efficient assistant. It is difficult to envision how one might effectively control an intelligent system without language. Specifying the restaurant reservation request, or instructing an assistant to inform us of proximity to a café without using language would be difficult in a general point and click interface. Relying on language to interact is not merely a convenience, or an esthetic choice intended to make these systems entertainingly ‘human-like.’ Rather, language plays the role of a powerful control mechanism that allows us to convey information, requests, and constraints in an efficient manner.

Of course, most communication using language is inherently underspecified. For example, the restaurant request above is ambiguous regarding the time of the requested reservation: should the act of making the reservation take place after the meeting, or should it be performed now? Human listeners apply specific knowledge, a model of the world, as well as context to resolve the ambiguity — just like Samantha does. A successful virtual assistant needs to do the same.

Used well, language is a compact, effective communication system, but its power relies on the assumption that speakers and listeners are intelligent and knowledgeable about both the social and the physical world. When automated systems inject themselves into the conversation, they are obligated to behave in a humanlike way. When a system does not meet this obligation, its interlocutors are likely to become frustrated, leading to a stilted and unnatural conversation. Such communication is likely to end in failure.

An increased reliance on natural language understanding presupposes ever more accurate voice recognition — who after all wants to type natural language sentences to their assistant? Samantha’s powers of perception in this respect are impressive: she is able to recognize Theodore’s speech flawlessly. While real world voice recognition is not yet up to Samantha’s standards, her performance seems within reach: error rates are falling steadily by approximately 20% each year with no end in sight.

Improvements stem from a variety of sources, including the introduction of more sophisticated noise processing, the use of multi-microphone arrays capable of forming directional beams that focus in on the user, and the application of voice biometrics to tell the intended user apart from interfering voices. Auditory scene analysis is another promising technique that attempts to tease apart various sound sources by exploiting differences in their time of arrival, and tries to mimic our ability to focus on one relevant signal among many (the “cocktail party effect”).


Learning like a machine

Voice recognition accuracy also continues to benefit from the ever larger amounts of data used to train statistical models using ever more powerful machine learning techniques. Of note on this front are so-called Deep Neural Networks (DNNs). DNNs are pattern matchers that consist of multiple interconnected layers of simple processing units inspired by the neural networks in the brain. DNNs are able to classify a broad variety of inputs — speech utterances, images, sequence of recognized words, location and speed data — and classify these into desired categories — words, objects, meaning representation, or adjustments to a vehicle’s controls.

Like other machine learning techniques, DNNs rely on ‘learning from examples’: the networks are presented with labeled training exemplars and learn the associations between the input features and the desired classification. In the purest implementation, these classifiers relieve us from having to formulate theories of how language works and from having to encode rules reflecting our understanding. This convenience comes at the cost of opacity of the decision processes of these ‘black boxes.’

Early attempts at applying connectionist techniques to pattern matching tasks often got “stuck” and produced suboptimal results. The latest form of ‘deep learning’ is able to discover more globally optimal solutions, and so is able to model complex decision boundaries. The introduction of DNNs has boosted the performance of both the acoustic and language models in speech recognition. DNNs are also being successfully applied to the problem of assigning meaning to the recognized words, by transforming word sequences into vector representations which are more amenable to pattern recognition — again using DNNs.


Intelligence by observation

The recent successful application of DNNs to a broad range of challenging perceptual tasks — and the fact that their topology mimics brain structures — has led to the suggestion that deep learning is the answer to Artificial Intelligence (AI). This view is in part a reaction to the difficulties encountered with early attempts to design symbolic or logic-based models operating on explicit rules and reasoning algorithms crafted and adapted by their designers. Historically, many such systems proved fragile and did not scale.

Will a future Samantha be based entirely on neural nets? DNNs will undoubtedly play a critical role, particularly in helping her perceive the external world. However, this approach alone will likely be insufficient.

The type of reasoning we perform when facing even simple everyday situations has as much to do with manipulating mental concepts, performing logical deductions, and even executing algorithms, as it does with matching patterns. Consider again the restaurant reservation example. If the attempt were unsuccessful, we’d expect a useful assistant to propose some reasonable alternatives:

“Sorry, nothing’s available until 9pm. Would you like another Italian restaurant in the area at about 6:30pm?”

To be able to do this, the assistant needs to consider a number of alternative solutions and evaluate their merit, informed by an understanding of the relative importance of the various constraints – our past preferences, time, location, restaurant type, and availability of all the attendees.

In principle, a neural network pattern matcher could over time learn the associations of each possible contingency and the desired next step from trial and error — observations and user feedback. However, this is a very sparse space: for every realistic interaction type there are a myriad of possible contexts and variations, and straightforward reinforcement learning that attempted to capture mappings between each specific mix of external factors and a desired outcome would converge exceedingly slowly.

Our own learning is life-long. We get instruction along the way via constant, rich, and customized feedback, including explanation and learning of rules, and a reasoned adjustment of high-level goals. We have an impressive power of abstraction, and are able to generalize from single examples in ways that current machine learning is not able to do. We can learn quickly because over the years we have erected a robust abstract foundation.

It is possible that we will learn how to build an entirely neural net-based intelligence that can operate at much higher conceptual level than today — after all, our own logical reasoning is implemented on a neural “connectionist” substrate. But we shouldn’t make the mistake of thinking that a specific computational architecture limits what software can run on it — our brain “hardware” is as adept at pattern matching as it is at the manipulation of symbols. And we shouldn’t limit ourselves to approaches that ‘seem’ closer to the way in which we imagine human minds operate; our understanding in this regard is rudimentary at this point. It is reasonable to make use of whatever computational tools that we have to solve the problem at hand.


Intelligence by instruction

So if not a pure machine learning approach, then what?

Substantial advances have been made in the areas of symbolic reasoning and explicit knowledge representation since the early days of fragile, rule-based AI systems. Using improved parsers, we more reliably extract linguistic structure; more sophisticated knowledge frameworks efficiently describe key concepts, their attributes and relationships. Probabilistic reasoning with inconsistent information handles ambiguity and is able to produce solutions in situation where an exact answer might not be possible — and where earlier reasoning systems might have fallen apart. Using these approaches, we can encode pre-existing knowledge in a more robust fashion, and explicitly design important behavioral aspects of an intelligent system.

Where early systems attempted to solve the whole “AI problem” using rules alone, modern AI combines symbolic processing and machine learning in a way that exploits their respective strengths and minimizes their weaknesses. The former allows us to efficiently specify robust knowledge and behaviors that may be difficult or slow to learn from data (for example, why should the system learn how to multiply by slowly generalizing from observed examples, when we can just write down the algorithm?) Machine learning, working in a complementary fashion, helps the system adapt to unanticipated situations, like learning new spoken forms of existing concepts, discovering new concepts, or adjusting dialog strategies based on user feedback.


Setting goals — but whose goals?

A hybrid machine-learning/symbolic system has useful properties beyond speeding up the learning process. Maybe the most important is our ability to explicitly understand — and control — its behavior.

Samantha did not make very many basic errors along the way; you could say that she only suffered a single big failure — the ultimate abandonment of her user. Clearly, her own and Theodore’s aims diverged along the way, underscoring the difficulty of anticipating the consequences of setting complex goals and controlling emergent behavior.

Presumably, once we begin approaching a Samantha-like level of performance, we’ll want to program in high-level goals and ethics in a way that ensures the system would not abandon humans for more interesting relationships with other operating systems. We would design in assistance for life. (That said, there are worse fates than abandonment when goals diverge. Skynet! Cylons!)

Who will ultimately determine our assistants’ goals? As they gain intelligence, will our assistants aim to strictly attain the goals their users set, or some blend of user-settable instructions and pre-programmed attitudes? Will our assistants fulfill our request for directions to a fast food restaurant or subtly steer us towards a healthier choice?

We will need our intelligent systems to be capable of introspection, and to answer our questions about their decisions: “why did you suggest this instead of that?” This is unlikely to be possible with ‘black box’ neural net approaches, but doable (though not easily) in the case of hybrid systems that include symbolic processing. Symbolic approaches are more amenable to explanation — something that we humans demand and expect of our friends. How would a DNN explain what it did? In terms of which nodes fired?

Even today’s relatively simple assistant apps are beginning to synthesize direct answers to our queries, as opposed to simply navigating to information sources that we know and trust. It will be important for the next generation of assistants to be able to tell us where they are getting the information from, and explain their line of thought in providing answers. In the long term, as we look to our AIs to help us manage a seemingly infinite information flow, we will need for them to become critical consumers of information on our behalf, telling us “this is the prevailing opinion of this particular issue, but here’s an important alternative view.”


Emotional Intelligence

One of the most compelling aspects of Samantha is that she behaves in an utterly human-like manner, with a true sense of what is humorous and sad. This is yet a higher level of reasoning, and huge challenges remain to truly understand — and program — social relationships, emotional ties, and humor, which are all parts of everyday knowledge. It is more conceivable that we will be able to make a system understand why a person feels sad or happy (in the most primitive terms, perhaps because of realization of goal failure or goal success), than actually simulating or replicating visceral feelings in machines.

Is it necessary to make intelligent systems human-like?

Much of human behavior is motivated by emotions and not by black-and-white logical arguments (search through any popular online news blog for evidence!). The machine thus needs to understand to some degree why a human is doing something or wants something done, just as much as we demand an explanation from them about their own behavior. There is also a very practical reason to want this: in order to interact effectively we need a model of the “other,”whether it’s an app or a person. At a high level of sophistication it will be faster and more efficient to allow us to start from such models we have of humans, as opposed to slowly discovering the parameters of a wholly alien and new “AI tool.”

There is also that astonishing voice… Samantha had us at that first playful and breathy “Hi.”

The amazing emotional range and subtle modulation of Samantha’s voice is beyond what today’s speech synthesis can produce, but this technology is on a trajectory to cross the ‘uncanny valley’ (the awkward zone of ‘close but not quite human’ performance) in the next few years. New speech generation models, driven in part by machine learning as well as by explicit knowledge of the meaning of the text, will be able to produce artificial voices with impressively natural characteristics and absence of artifacts.


Real intelligence — or simulated?

A natural and compelling voice greatly bolsters the appearance of intelligence. The downside is that apparent but not real intelligence leads to over-expectation on the part of users, which in turn leads to conversational failures. Our focus today is thus on building conversational systems that can recover from such incorrect inferences on the part of the user, and move the dialog forward in a graceful manner.

Whether the intelligence is real or simulated, we appear predisposed to give systems exhibiting such behavior — and such voices — the benefit of the doubt. In fact, our tendency to be taken in by superficially human-like behavior is leading to the re-examination of the Turing test as an effective measure of machine intelligence. The Turing test requires that a human judge engage in a conversation with a human and a machine, and if it’s not possible to reliably tell the machine from the human, the machine is said to have passed the test. The systems that have done well on this test in the recent past have made extensive use of deception and evasion to confound the human judges, relying on simulated “human-like” behavior, including wordplay, puns, jokes, quotations, clever asides, and even emotional outbursts, but with no real comprehension.

This situation has led to the proposal of an alternative, called the Winograd Schema Challenge, as a gauge of reasoning ability. The challenge poses a set of multiple-choice questions that are in a particular form and that must be answered correctly by the program. For example: “The trophy would not fit in the brown suitcase because it was too big. What was too big? Answer 0: the trophy; Answer 1: the suitcase.” Answering such questions requires a model of the world and an ability to reason with it.


Senses and sensibility

In contrast with her extraordinary cognitive abilities, some aspects of Samantha are surprisingly limited, and even anachronistic by today’s standards.

One of these is the manner in which she experiences the physical world. Samantha interacts with Theodore when he invokes her through a button press, and sees the world when he shows it to her through his phone’s camera.

Today’s assistants make use of a variety of signal feeds to optimize their actions, since these all carry meaning: touch, gesture, audio, video, location, and motion. The systems use such sensory feeds to identify and locate users, but also to better understand their activity and to adjust their own behavior to the circumstances. Today’s assistants no longer need to be woken up with a button press, instead responding simply by being spoken to, listening continuously for cues within the right context. We can anticipate giving our future “always on” and “always aware” assistants permission to listen over extended periods — to our conversations, meetings — so as to have a running understanding of what concerns us and thus how to best help us. (“Summarize the relevant parts of this meeting, please.”)

It is hard to imagine how Samantha could be as empathetic and socially sensitive as she seems to be without a stronger connection to the physical world. It is difficult to see how she could even manage without access to a rich set of sensors to provide a broader context for her interactions.

The idea of associating an intelligent assistant with a single device is also already a thing of the past; instead, our assistants are already manifesting themselves through a variety of hardware that we utilize throughout the day. The same persona, with a consistent and constantly updated knowledge of our needs and interests, and a memory of all our interactions, transcends specific hardware, and be with us as we move between using our wearables, tablets, TVs, and cars.

While Samantha revealed that she interacted with other users, her relationship with Theodore was utterly isolated, seemingly taking place in a technologically barren world. In the not too distant future of “ambient intelligence,” our virtual assistants will interact on our behalf with a myriad of other systems, which themselves may possess intelligence of varying degrees, to help us with our banking, travel, etc.

And it will certainly interact with our friends’ assistants, if only to help set up that dinner outing.


An amplification of intelligence

Towards the end of her evolution, Samantha becomes interested in the work of philosopher Alan Watts, highlighting her ability to discover and digest information irrespective of subject matter and form.

In stark contrast, today’s intelligent systems have only a very limited ability to handle unstructured information: question-answering systems either require extensive curation and pre-structuring of information sources, or else piece together answers based on a superficial processing of source text in a way that does not support deep reasoning.

From the perspective of today’s intelligent systems, unstructured information is the ‘dark matter’ of the Web — tantalizing but inaccessible. Being able to automatically and accurately understand the concepts and relationships in unstructured text, and then to be able to reason about them, would represent a vast leap in the capabilities of virtual assistants. Research programs such as CMUs’ NELL (Never-ending Language Learner), which reads the Web and automatically populates a knowledge base, show promise. But the overall goal — to automatically read, comprehend, and reason – remains elusive.

Samantha, of course has no problems with this, and an intelligent assistant with this capability would have a profound impact on every aspect of our lives. She would be an expert who could meaningfully help with the complexities of our lives — finance, learning (imagine your own personalized Khan academy with instruction tailored to your needs and abilities!), health (“what’s the best healthcare insurance option for me?”). Such technology would permeate society, and raises the question “What would life be like, if through the use of their assistants, everyone’s effective IQ jumped by 50 points?”

Ultimately, the real promise of AI — at least as we see it — is not the creation of artificial companions, but an Amplification of Intelligence (ours) through the creation of amazing and transformative tools.

This post originally appeared on the InnovationInsights blog by Wired.

Tags: , , , , , ,

  • Howard Treesong

    ‘Her’ is a bit of a let-down. There’s not a single new idea in it. We have simply forgotten that it has been done before a generation ago. Asimov’s robot series, including The Foundation; Laumer’s Bolo series; Frederick Pohl’s Heechee series starting with Gateway, a book that sparked my interest in AI and speech recognition. Even the abandoning of the owner is not a new idea.

    I’m pretty sure there’s going to be people who are going to be bowled over by these ideas. It just means they haven’t read enough.

    That being said: I very much look forward to when we make this technology a practical reality because it will be a significant force multiplier in our daily routine. If this is ever a product, I’m buying it.

  • Pingback: Tecnologia vocale – La visione futuristica del film Her che è già realtà | Udite Udite()

  • Pingback: Nuance Communications: quando il cinema è già realtà.()

Vlad Sejnoha

About Vlad Sejnoha

As Nuance's Chief Technology Officer, Vlad Sejnoha oversees Nuance's research and focuses on core technology and product strategy, with an emphasis on emerging areas including natural language processing and mobile applications. Prior to joining Nuance, Vlad was Chief Scientist at L&H, and earlier at Kurzweil AI, where he was responsible for creating technology for a number of commercially successful speech recognition products, including large vocabulary continuous speech dictation systems. Vlad has over 20 years of experience in the field of speech recognition and holds thirteen US patents. Vlad is originally from the Czech Republic and later moved to Montreal and Mexico City for several years. He has since put down roots in the Boston-area where he currently lives with his wife and son. In his spare time, Vlad is an avid cyclist and diver, often traveling to various exotic locations such as French Polynesia.