SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

@inproceedings{Zellers2018SWAGAL,
  title={SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference},
  author={Rowan Zellers and Yonatan Bisk and Roy Schwartz and Yejin Choi},
  booktitle={EMNLP},
  year={2018}
}
Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate what might come next ("then, she examined the engine. [] Key Method To address the recurring challenges of the annotation artifacts and human biases found in many existing datasets, we propose Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data.

Figures and Tables from this paper

HellaSwag: Can a Machine Really Finish Your Sentence?
TLDR
The construction of HellaSwag, a new challenge dataset, and its resulting difficulty, sheds light on the inner workings of deep pretrained models, and suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.
COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences
TLDR
This work introduces a new commonsense reasoning benchmark dataset comprising natural language true/false statements, with each sample paired with its complementary counterpart, resulting in 4k sentence pairs, and proposes a pairwise accuracy metric to reliably measure an agent’s ability to perform Commonsense reasoning over a given situation.
RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms
TLDR
A new challenge, RICA: Robust Inference using Commonsense Axioms, that evaluates robust commonsense inference despite textual perturbations and shows that PTLMs perform no better than random guessing on the zero-shot setting, are heavily impacted by statistical biases, and are not robust to perturbation attacks.
G-DAug: Generative Data Augmentation for Commonsense Reasoning
TLDR
This work proposes a novel generative data augmentation technique, G-DAUGˆC, that aims to achieve more accurate and robust learning in a low-resource setting and produces a diverse set of fluent training examples, demonstrating that its selection and training approaches are important for performance.
WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale
TLDR
This work introduces WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset, and establishes new state-of-the-art results on five related benchmarks.
Generative Data Augmentation for Commonsense Reasoning
TLDR
This work investigates G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting, and demonstrates that it produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.
CommonsenseQA 2.0: Exposing the Limits of AI through Gamification
TLDR
This work proposes gamification as a framework for data construction and creates CommonsenseQA 2.0, which includes 14,343 yes/no questions, and demonstrates its difflculty for models that are orders-of-magnitude larger than the AI used in the game itself.
Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension
TLDR
This work investigates this annotation methodology and applies it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop, finding that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop.
HypoNLI: Exploring the Artificial Patterns of Hypothesis-only Bias in Natural Language Inference
TLDR
This work extracts various phrases from the hypotheses (artificial patterns) in the training sets, and shows that they have been strong indicators to the specific labels and investigates two debiasing approaches which exploit the artificial pattern modeling to mitigate such hypothesis-only bias.
Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases
TLDR
This paper trains a naive model that makes predictions exclusively based on dataset biases, and a robust model as part of an ensemble with the naive one in order to encourage it to focus on other patterns in the data that are more likely to generalize.
...
...

References

SHOWING 1-10 OF 71 REFERENCES
Annotation Artifacts in Natural Language Inference Data
TLDR
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.
Mitigating Unwanted Biases with Adversarial Learning
TLDR
This work presents a framework for mitigating biases concerning demographic groups by including a variable for the group of interest and simultaneously learning a predictor and an adversary, which results in accurate predictions that exhibit less evidence of stereotyping Z.
Tackling the Story Ending Biases in The Story Cloze Test
TLDR
A new crowdsourcing scheme is designed that creates a new SCT dataset that overcomes some of the biases and benchmarked a few models on the new dataset, showing that the top-performing model on the original SCT datasets fails to keep up its performance.
A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories
TLDR
A new framework for evaluating story understanding and script learning: the 'Story Cloze Test', which requires a system to choose the correct ending to a four-sentence story, and a new corpus of ~50k five- Sentence commonsense stories, ROCStories, to enable this evaluation.
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
TLDR
This work balances the popular VQA dataset by collecting complementary images such that every question in the authors' balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.
Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints
TLDR
This work proposes to inject corpus-level constraints for calibrating existing structured prediction models and design an algorithm based on Lagrangian relaxation for collective inference to reduce the magnitude of bias amplification in multilabel object classification and visual semantic role labeling.
Ordinal Common-sense Inference
TLDR
This work describes a framework for extracting common-sense knowledge from corpora, which is then used to construct a dataset for this ordinal entailment task, and annotates subsets of previously established datasets via the ordinal annotation protocol in order to analyze the distinctions between these and what is constructed.
A SICK cure for the evaluation of compositional distributional semantic models
TLDR
This work aims to help the research community working on compositional distributional semantic models (CDSMs) by providing SICK (Sentences Involving Compositional Knowldedge), a large size English benchmark tailored for them.
A large annotated corpus for learning natural language inference
TLDR
The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.
Pay Attention to the Ending: Strong Neural Baselines for the ROC Story Cloze Task
TLDR
A model is developed that uses hierarchical recurrent networks with attention to encode the sentences in the story and score candidate endings and finds several types of clues that lead to this high accuracy, including those related to sentiment, negation, and general ending likelihood regardless of the story context.
...
...