In 2014, Nuance along with Commonsensereasoning.org announced plans for a Winograd Schema Challenge – a call for students, research and academia to design programs that leverage AI to demonstrate reasoning capabilities by answering Winograd Schema questions.
The Winograd Schema Challenge is an alternative to the Turing Test (short free-form questions) that provides a more accurate measure of genuine machine intelligence by posing a set of multiple-choice questions that have a form where the answers are expected to be fairly obvious to a layperson, but ambiguous for a machine without human-like reasoning or intelligence.
Yesterday, the results of the first annual Winograd Schema Challenge were unveiled at the International Joint Conference on Artificial Intelligence (IJCAI-2016) in New York. Six programs were submitted from independent researchers and students around the world, demonstrating a variety of approaches to solving challenge questions, while underscoring the difficulty machines have in understanding using commonsense reasoning to improve the understanding of human language.
Scores ranged from the low 30th percentile in answering the questions correctly to the high 50s – demonstrating that while some of the Winograd Schema questions could be handled, much more research is needed to develop systems that can handle these sorts of test. For comparison, human subjects were asked the same set of questions, with an overall average of 90.9% answered correctly.
* A problem was discovered at the last minute with unexpected punctuation the XML input impacting a handful of questions. All tests will be run through again, but unlikely to have a significant impact on the scores.
The challenge involves two rounds of testing: the first round involves Pronoun Disambiguation Problems (PDPs) that are similar, but slightly different that Winograd Schemas (WSs). A contestant must score very well on the PDPs to move on to the next round of WSs.
Professor of computer science Ernest Davis of NYU has created a large library of WSs. Winograd Schema questions are manually generated; examples include:
“The trophy would not fit in the brown suitcase because it was too big. What was too big? Answer 0: the trophy or Answer 1: the suitcase?”
“Joan made sure to thank Susan for all the help she had given. Who had given the help? Answer 0: Joan or Answer 1: Susan”
PDP’s were collected from children’s literature and also require commonsense reasoning to understand the relationships between object and events so that the prospering referent to a pronoun could be determined. Further, PDPs are abundant in our everyday language and occur organically. An example of a PDP is:
Babar wonders how he can get new clothing. Luckily, a very rich old man who has always been fond of little elephants understands right away that he is longing for a fine suit. As he likes to make people happy, he gives him his wallet. “he is longing for a fine suit”: (a) Babar (b) old man Answer: (a) Babar
While simple for humans, AI computer systems today lack sufficient commonsense knowledge and reasoning to solve these questions. Each Schema and PDP involves a variety of different types of relationships, such as cause-and-effect, spatial, temporal and social relationships.
So what does this all mean for the state of AI and machine reasoning?
The Challenge underscored the challenges of understanding language and reasoning about the world that AI still faces. However, the Challenge also provided a baseline for subsequent systems and testing. Participants made use of a variety of technologies such as natural language parsing, knowledge acquisition and deep learning.
So now, we look to 2018 where the next Winograd Schema Challenge will be judged at the 2018 AAAI event – and given the rate of innovation and intelligence – could potentially deliver results that take the state of the art in human-machine interaction even further.