• Corpus ID: 17263631

Establishing a Human Baseline for the Winograd Schema Challenge

  title={Establishing a Human Baseline for the Winograd Schema Challenge},
  author={David Bender},
The Winograd Schema Challenge (WSC) is a pronoun resolution task for which deep semantic knowledge is required to achieve high performance. Until now it has been assumed that human performance on the WSC is nearly at ceiling, but evidence for this has been mainly anecdotal. Here we present the results of a large online experiment that both establishes a baseline for human performance on the WSC and demonstrates the importance of human testing, not only as a means of validating a particular… 

Figures from this paper

The Defeat of the Winograd Schema Challenge
The history of the Winograd Schema Challenge is reviewed, and a number of AI systems, based on large pre-trained transformer-based language models and fine-tuned on these kinds of problems, achieved better than 90% accuracy.
A Data-Driven Metric of Hardness for WSC Sentences
A large-scale experiment shows how the performance of that particular automated approach varies with the availability of training material in the Winograd Schema Challenge, and finds that thePerformance of the automated approach correlates positively with the performance with humans, suggesting that the performance could be used as a metric of hardness for WSC instances.
Winograd Schemas in Portuguese
A language model for Portuguese is created based on a set of Wikipedia documents to stimulate the development of Natural Language Processing in Portuguese in the Winograd Schema Challenge.
WinoReg: A New Faster and More Accurate Metric of Hardness for Winograd Schemas
WinoReg is presented, a new system to compute hardness of Winograd Schemas, by training a Random Forest classifier over a rich set of features identified in relevant WSC works in the literature, which is considerably faster and more accurate compared to the system proposed in earlier work.
WinoLogic: A Zero-Shot Logic-based Diagnostic Dataset for Winograd Schema Challenge
A logic-based framework that focuses on high-quality commonsense knowledge, which identifies and collects formal knowledge formulas verified by theorem provers and translates such formulas into natural language sentences and proposes a new dataset named WinoLogic with these sentences.
An approach to the Winograd Schema Challenge based on semantic classification of events and adjectives
The objective of this research is to introduce a commonsense based logical method that can achieve competitive accuracy with the statistical methods on a specific form of the Winograd Schema problems.
Mandarinograd: A Chinese Collection of Winograd Schemas
This article introduces Mandarinograd, a corpus of Winograd Schemas in Mandarin Chinese. Winograd Schemas are particularly challenging anaphora resolution problems, designed to involve common sense
A Knowledge Hunting Framework for Common Sense Reasoning
An automatic system that achieves state-of-the-art results on the Winograd Schema Challenge (WSC), a common sense reasoning task that requires diverse, complex forms of inference and knowledge, using a knowledge hunting module to gather text from the web to serve as evidence for candidate problem resolutions.
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
A novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which is called Winoground and aims for it to serve as a useful evaluation set for advancing the state of the art and driv-ing further progress in the industry.
Investigating associative, switchable and negatable Winograd items on renewed French data sets
The update of the existing French data set and the creation of three subsets allowing for a more robust, fine-grained evaluation protocol of WSC in French, showing in addition that the higher performance could be explained by the existence of associative items in FWSC.


Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge
A knowledge-rich approach to the task of resolving complex cases of definite pronouns is employed, which yields a pronoun resolver that outperforms state-of-the-art resolvers by nearly 18 points in accuracy on the authors' dataset.
The Winograd Schema Challenge
This paper presents an alternative to the Turing Test that has some conceptual and practical advantages, and English-speaking adults will have no difficulty with it, and the subject is not required to engage in a conversation and fool an interrogator into believing she is dealing with a person.
The PASCAL Recognising Textual Entailment Challenge
This paper presents the Third PASCAL Recognising Textual Entailment Challenge (RTE-3), providing an overview of the dataset creating methodology and the submitted systems. In creating this year's
SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning
The two systems that competed in this task as part of SemEval-2012 are described, and their results are compared to those achieved in previously published research.
Understanding natural language
A computer system for understanding English that contains a parser, a recognition grammar of English, programs for semantic analysis, and a general problem solving system based on the belief that in modeling language understanding, it must deal in an integrated way with all of the aspects of language—syntax, semantics, and inference.
Evaluating Amazon's Mechanical Turk as a Tool for Experimental Behavioral Research
This paper replicates a diverse body of tasks from experimental psychology including the Stroop, Switching, Flanker, Simon, Posner Cuing, attentional blink, subliminal priming, and category learning tasks using participants recruited using AMT.
Amazon's Mechanical Turk
Findings indicate that MTurk can be used to obtain high-quality data inexpensively and rapidly and the data obtained are at least as reliable as those obtained via traditional methods.
Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers
It is shown that crowdsourced workers are likely to participate across multiple related experiments and that researchers are overzealous in the exclusion of research participants, which can be avoided using advanced interface features that also allow prescreening and longitudinal data collection.
Financial incentives and the "performance of crowds"
It is found that increased financial incentives increase the quantity, but not the quality, of work performed by participants, where the difference appears to be due to an "anchoring" effect.