WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale

@article{Sakaguchi2020WINOGRANDEAA,
  title={WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale},
  author={Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi},
  journal={ArXiv},
  year={2020},
  volume={abs/1907.10641}
}
The Winograd Schema Challenge (WSC), proposed by Levesque et al. (2011) as an alternative to the Turing Test, was originally designed as a pronoun resolution problem that cannot be solved based on statistical patterns in large text corpora. [...] Key Method Key to our approach is a novel adversarial filtering algorithm AFLITE for systematic bias reduction, combined with a careful crowdsourcing design. Despite the significant increase in training data, the performance of existing state-of-the-art methods remains…Expand
Precise Task Formalization Matters in Winograd Schema Evaluations
TLDR
This work performs an ablation on two Winograd Schema datasets that interpolates between the formalizations used before and after this surge, and finds framing the task as multiple choice improves performance by 2-6 points and several additional techniques can mitigate the model's extreme sensitivity to hyperparameters. Expand
WinoWhy: A Deep Diagnosis of Essential Commonsense Knowledge for Answering Winograd Schema Challenge
TLDR
This paper presents the first comprehensive categorization of essential commonsense knowledge for answering the Winograd Schema Challenge (WSC), and leverages the collected reasons to develop a new task called WinoWhy, which requires models to distinguish plausible reasons from very similar but wrong reasons for all WSC questions. Expand
Are Rotten Apples Edible? Challenging Commonsense Inference Ability with Exceptions
TLDR
It is shown that language models in the BERT family experience a steep drop in performance on prompts that require them to pick answers which require reasoning about context, suggesting a need for future work in developing and analyzing frameworks similar to WINOVENTI that are tuned to model-specific weaknesses. Expand
A Review of Winograd Schema Challenge Datasets and Approaches
TLDR
This paper reviews existing Winograd Schema Challenge benchmark datasets and approaches that have been published since its introduction and suggests new approaches that should be considered. Expand
COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences
TLDR
This work introduces a new commonsense reasoning benchmark dataset comprising natural language true/false statements, with each sample paired with its complementary counterpart, resulting in 4k sentence pairs, and proposes a pairwise accuracy metric to reliably measure an agent’s ability to perform Commonsense reasoning over a given situation. Expand
G-DAug: Generative Data Augmentation for Commonsense Reasoning
TLDR
This work proposes a novel generative data augmentation technique, G-DAUGˆC, that aims to achieve more accurate and robust learning in a low-resource setting and produces a diverse set of fluent training examples, demonstrating that its selection and training approaches are important for performance. Expand
The Sensitivity of Language Models and Humans to Winograd Schema Perturbations
TLDR
Results highlight interesting differences between humans and language models: language models are more sensitive to number or gender alternations and synonym replacements than humans, and humans are more stable and consistent in their predictions, maintain a much higher absolute performance, and perform better on non-associative instances than associative ones. Expand
An Analysis of Dataset Overlap on Winograd-Style Tasks
TLDR
It is found that a large number of test instances overlap considerably with the pretraining corpora on which state-of-the-art models are trained, and that a significant drop in classification accuracy occurs when models are evaluated on instances with minimal overlap. Expand
UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark
TLDR
A new multitask benchmark, RAINBOW, is proposed to promote research on commonsense models that generalize well over multiple tasks and datasets and a novel evaluation is proposed, the cost equivalent curve, that sheds new insight on how the choice of source datasets, pretrained language models, and transfer learning methods impacts performance and data efficiency. Expand
CommonsenseQA 2.0: Exposing the Limits of AI through Gamification
Constructing benchmarks that test the abilities of modern natural language un1 derstanding models is difficult – pre-trained language models exploit artifacts in 2 benchmarks to achieve human parity,Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 57 REFERENCES
A Surprisingly Robust Trick for the Winograd Schema Challenge
TLDR
This paper shows that the performance of three language models on WSC273 strongly improves when fine-tuned on a similar pronoun disambiguation problem dataset (denoted WSCR), and generates a large unsupervised WSC-like dataset. Expand
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
TLDR
This paper introduces the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and proposes Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data. Expand
A Simple Method for Commonsense Reasoning
TLDR
Key to this method is the use of language models, trained on a massive amount of unlabled data, to score multiple choice questions posed by commonsense reasoning tests, which outperform previous state-of-the-art methods by a large margin. Expand
Establishing a Human Baseline for the Winograd Schema Challenge
TLDR
The results of a large online experiment are presented that both establishes a baseline for human performance on the WSC and demonstrates the importance of human testing, not only as a means of validating a particular corpus, but more fundamentally as a guide in defining desirable characteristics for Winograd. Expand
Commonsense Knowledge Enhanced Embeddings for Solving Pronoun Disambiguation Problems in Winograd Schema Challenge
TLDR
This paper proposes commonsense knowledge enhanced embeddings (KEE) for solving the Pronoun Disambiguation Problems (PDP) and indicates that, the proposed KEE models could solve the PDP problems by achieving 66.7% accuracy, which is a new state-of-the-art performance. Expand
Annotation Artifacts in Natural Language Inference Data
TLDR
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes. Expand
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
TLDR
This work presents CommonsenseQA: a challenging new dataset for commonsense question answering, which extracts from ConceptNet multiple target concepts that have the same semantic relation to a single source concept. Expand
Easy Victories and Uphill Battles in Coreference Resolution
TLDR
This work presents a state-of-the-art coreference system that captures various syntactic, discourse, and semantic phenomena implicitly, with a small number of homogeneous feature templates examining shallow properties of mentions, allowing it to win “easy victories” without crafted heuristics. Expand
Probing Neural Network Comprehension of Natural Language Arguments
TLDR
This work analyzes the nature of spurious statistical cues in the dataset and demonstrates that a range of models all exploit them, informing the construction of an adversarial dataset on which all models achieve random accuracy. Expand
On the Evaluation of Common-Sense Reasoning in Natural Language Understanding
TLDR
A case study of the Winograd Schema Challenge is made and a protocol is designed, based on two new measures of instance-level complexity, that both clarifies and qualifies the results of previous work. Expand
...
1
2
3
4
5
...