Winograd Schema Challenge: Can computers reason like humans?

In 2014, Nuance along with Commonsensereasoning.org announced plans for a Winograd Schema Challenge – a call for students, research and academia to design programs that leverage AI to demonstrate reasoning capabilities by answering Winograd Schema questions. Yesterday, the results of the first annual Winograd Schema Challenge were unveiled at the International Joint Conference on Artificial Intelligence (IJCAI-2016) in New York, underscoring the difficulty machines have in understanding using commonsense reasoning to improve the understanding of human language.
By
Contestants for the Winograd Schema Challenge build intelligent systems to test natural language and reasoning capabilities.

In 2014, Nuance along with Commonsensereasoning.org announced plans for a Winograd Schema Challenge – a call for students, research and academia to design programs that leverage AI to demonstrate reasoning capabilities by answering Winograd Schema questions.

The Winograd Schema Challenge is an alternative to the Turing Test (short free-form questions) that provides a more accurate measure of genuine machine intelligence by posing a set of multiple-choice questions that have a form where the answers are expected to be fairly obvious to a layperson, but ambiguous for a machine without human-like reasoning or intelligence.

Yesterday, the results of the first annual Winograd Schema Challenge were unveiled at the International Joint Conference on Artificial Intelligence (IJCAI-2016) in New York. Six programs were submitted from independent researchers and students around the world, demonstrating a variety of approaches to solving challenge questions, while underscoring the difficulty machines have in understanding using commonsense reasoning to improve the understanding of human language.

Scores ranged from the low 30th percentile in answering the questions correctly to the high 50s – demonstrating that while some of the Winograd Schema questions could be handled, much more research is needed to develop systems that can handle these sorts of test. For comparison, human subjects were asked the same set of questions, with an overall average of 90.9% answered correctly.

 

winograd schema challenge participants results

* A problem was discovered at the last minute with unexpected punctuation the XML input impacting a handful of questions. All tests will be run through again, but unlikely to have a significant impact on the scores.

 

The challenge involves two rounds of testing: the first round involves Pronoun Disambiguation Problems (PDPs) that are similar, but slightly different that Winograd Schemas (WSs). A contestant must score very well on the PDPs to move on to the next round of WSs.

Professor of computer science Ernest Davis of NYU has created a large library of WSs. Winograd Schema questions are manually generated; examples include:

“The trophy would not fit in the brown suitcase because it was too big. What was too big? Answer 0: the trophy or Answer 1: the suitcase?”

“Joan made sure to thank Susan for all the help she had given. Who had given the help? Answer 0: Joan or Answer 1: Susan”

PDP’s were collected from children’s literature and also require commonsense reasoning to understand the relationships between object and events so that the prospering referent to a pronoun could be determined. Further, PDPs are abundant in our everyday language and occur organically. An example of a PDP is:

Babar wonders how he can get new clothing. Luckily, a very rich old man who has always been fond of little elephants understands right away that he is longing for a fine suit. As he likes to make people happy, he gives him his wallet. “he is longing for a fine suit”: (a) Babar (b) old man Answer: (a) Babar  

While simple for humans, AI computer systems today lack sufficient commonsense knowledge and reasoning to solve these questions. Each Schema and PDP involves a variety of different types of relationships, such as cause-and-effect, spatial, temporal and social relationships.

So what does this all mean for the state of AI and machine reasoning?

The Challenge underscored the challenges of understanding language and reasoning about the world that AI still faces. However, the Challenge also provided a baseline for subsequent systems and testing. Participants made use of a variety of technologies such as natural language parsing, knowledge acquisition and deep learning.

So now, we look to 2018 where the next Winograd Schema Challenge will be judged at the 2018 AAAI event – and given the rate of innovation and intelligence – could potentially deliver results that take the state of the art in human-machine interaction even further.

Tags: , , ,

  • loebner

    Weinograd questions are not incompatible with a Turing Test. The judge in a TT can always choose to ask Weinograd questions. Two of the pre-selection questions for the 2016 Loebner Prize were Weinograd questions.

    • Charlie Ortiz

      That’s true. But the “rules” of the Turing test and the way it is graded (completely subjectively) do not prevent the bot from trying to evade the question and answer something like, “You’ve got to be kidding? You don’t know the answer to that?” The Turing test is too susceptible to trickery as it stands. As I like to say, it is a good test for determining whether someone has a future in politics 🙂

      • loebner

        A human can answer the question, a bot can’t. Any competent judge can tell if the bot is trying to evade the issue. The TT depends on both judge and human making a good faith effort

Charles Ortiz

About Charles Ortiz

Charles Ortiz is Director of the Artificial Intelligence and Reasoning Group at the Nuance Natural Language and AI Laboratory in Sunnyvale, CA. Prior to joining Nuance, he was the director of research in collaborative multi-agent systems at the AI Center at SRI International. His research interests and contributions are in multiagent systems (collaborative dialogue-structured assistants, collaborative work environments, negotiation protocols, and logic-based BDI theories), knowledge representation and reasoning (causation, counterfactuals, and commonsense reasoning), and robotics (cognitive robotics, team-based robotics, and dialogue-based human-robot interaction). He has approximately 20 years of technical leadership and management experience in leading major projects and setting strategic directions. He has collaborated extensively with faculty and students at many academic institutions including Harvard University, Bar-Ilan University, UC Berkeley, Columbia University, University of Southern California, Vassar College, and Carnegie Mellon University. He holds a S.B. in Physics from MIT, an M.S. in Computer Science from Columbia University, and a Ph.D. in Computer and Information Science from the University of Pennsylvania. Following his PhD research, he was a Postdoctoral Research Fellow at Harvard University. He has taught courses at Harvard and at UC Berkeley (as an an Adjunct Professor) and has also presented tutorials at technical conferences (IJCAI 1999 and 2005, AAAI 2002 and 2004, AAMAS 2002-2004).