HellaSwag: Can a Machine Really Finish Your Sentence?

@inproceedings{Zellers2019HellaSwagCA,
  title={HellaSwag: Can a Machine Really Finish Your Sentence?},
  author={Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi},
  booktitle={ACL},
  year={2019}
}
Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys. [] Key Result More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

CommonsenseQA 2.0: Exposing the Limits of AI through Gamification

TLDR
This work proposes gamification as a framework for data construction and creates CommonsenseQA 2.0, which includes 14,343 yes/no questions, and demonstrates its difflculty for models that are orders-of-magnitude larger than the AI used in the game itself.

COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences

TLDR
This work introduces a new commonsense reasoning benchmark dataset comprising natural language true/false statements, with each sample paired with its complementary counterpart, resulting in 4k sentence pairs, and proposes a pairwise accuracy metric to reliably measure an agent’s ability to perform Commonsense reasoning over a given situation.

Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension

TLDR
This work investigates this annotation methodology and applies it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop, finding that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop.

G-DAug: Generative Data Augmentation for Commonsense Reasoning

TLDR
This work proposes a novel generative data augmentation technique, G-DAUGˆC, that aims to achieve more accurate and robust learning in a low-resource setting and produces a diverse set of fluent training examples, demonstrating that its selection and training approaches are important for performance.

Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning

TLDR
This paper introduces a new scoring method that casts a plausibility ranking task in a full-text format and leverages the masked language modeling head tuned during the pre-training phase and requires less annotated data than the standard classifier approach to reach equivalent performances.

On the Efficacy of Adversarial Data Collection for Question Answering: Results from a Large-Scale Randomized Study

TLDR
Across a variety of models and datasets, it is found that models trained on adversarial data usually perform better on other adversarial datasets but worse on a diverse collection of out-of-domain evaluation sets.

CommonGen: A Constrained Text Generation Dataset Towards Generative Commonsense Reasoning

TLDR
This work presents CommonGen: a challenging dataset for testing generative commonsense reasoning with a constrained text generation task, and provides high-quality rationales behind the reasoning process for the development and test sets from the human annotators.

Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair

TLDR
This work studies the impact of applying three common approaches for adversarial dataset creation: filtering out easy examples, perturbing examples, and model-in-the-loop data collection (ANLI and AdversarialQA), across 18 different adversary models.

Generative Data Augmentation for Commonsense Reasoning

TLDR
This work investigates G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting, and demonstrates that it produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.

PIQA: Reasoning about Physical Commonsense in Natural Language

TLDR
The task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA are introduced and analysis about the dimensions of knowledge that existing models lack are provided, which offers significant opportunities for future research.
...

References

SHOWING 1-10 OF 22 REFERENCES

Annotation Artifacts in Natural Language Inference Data

TLDR
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Language Models are Unsupervised Multitask Learners

TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

Adversarial Examples for Evaluating Reading Comprehension Systems

TLDR
This work proposes an adversarial evaluation scheme for the Stanford Question Answering Dataset that tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences without changing the correct answer or misleading humans.

The Curious Case of Neural Text Degeneration

TLDR
By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.

Enhanced LSTM for Natural Language Inference

TLDR
This paper presents a new state-of-the-art result, achieving the accuracy of 88.6% on the Stanford Natural Language Inference Dataset, and demonstrates that carefully designing sequential inference models based on chain LSTMs can outperform all previous models.

Hypothesis Only Baselines in Natural Language Inference

TLDR
This approach, which is referred to as a hypothesis-only model, is able to significantly outperform a majority-class baseline across a number of NLI datasets and suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context.

Synthetic and Natural Noise Both Break Neural Machine Translation

TLDR
It is found that a model based on a character convolutional neural network is able to simultaneously learn representations robust to multiple kinds of noise, including structure-invariant word representations and robust training on noisy texts.

Dense-Captioning Events in Videos

TLDR
This work proposes a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language, and introduces a new captioning module that uses contextual information from past and future events to jointly describe all events.

Deep Contextualized Word Representations

TLDR
A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.