HellaSwag: Can a Machine Really Finish Your Sentence?

@inproceedings{Zellers2019HellaSwagCA,
  title={HellaSwag: Can a Machine Really Finish Your Sentence?},
  author={Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi},
  booktitle={ACL},
  year={2019}
}
Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys. [...] Key Result More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.Expand
COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences
TLDR
This work introduces a new commonsense reasoning benchmark dataset comprising natural language true/false statements, with each sample paired with its complementary counterpart, resulting in 4k sentence pairs, and proposes a pairwise accuracy metric to reliably measure an agent’s ability to perform Commonsense reasoning over a given situation. Expand
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning
TLDR
This paper introduces a new scoring method that casts a plausibility ranking task in a full-text format and leverages the masked language modeling head tuned during the pre-training phase and requires less annotated data than the standard classifier approach to reach equivalent performances. Expand
G-DAug: Generative Data Augmentation for Commonsense Reasoning
TLDR
This work proposes a novel generative data augmentation technique, G-DAUGˆC, that aims to achieve more accurate and robust learning in a low-resource setting and produces a diverse set of fluent training examples, demonstrating that its selection and training approaches are important for performance. Expand
On the Efficacy of Adversarial Data Collection for Question Answering: Results from a Large-Scale Randomized Study
TLDR
Across a variety of models and datasets, it is found that models trained on adversarial data usually perform better on other adversarial datasets but worse on a diverse collection of out-of-domain evaluation sets. Expand
CommonGen: A Constrained Text Generation Dataset Towards Generative Commonsense Reasoning
TLDR
This work presents CommonGen: a challenging dataset for testing generative commonsense reasoning with a constrained text generation task, and provides high-quality rationales behind the reasoning process for the development and test sets from the human annotators. Expand
PIQA: Reasoning about Physical Commonsense in Natural Language
TLDR
The task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA are introduced and analysis about the dimensions of knowledge that existing models lack are provided, which offers significant opportunities for future research. Expand
Generating Adversarial Examples for Topic-Dependent Argument Classification
TLDR
The aim of the current work is to improve the robustness of argument classification models using adversarial training, and to prove the robust-ness of BERT for the argument classification task, yet highlighting that it is not invulnerable to simple linguistic perturbations in the input data. Expand
Adversarial NLI: A New Benchmark for Natural Language Understanding
TLDR
This work introduces a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure, and shows that non-expert annotators are successful at finding their weaknesses. Expand
Improving Paraphrase Detection with the Adversarial Paraphrasing Task
TLDR
A new adversarial method of dataset creation for paraphrase identification is introduced: the Adversarial Paraphrasing Task (APT), which asks participants to generate semantically equivalent but lexically and syntactically disparate paraphrases. Expand
Evaluating NLP Models via Contrast Sets
TLDR
A new annotation paradigm for NLP is proposed that helps to close systematic gaps in the test data, and it is recommended that after a dataset is constructed, the dataset authors manually perturb the test instances in small but meaningful ways that change the gold label, creating contrast sets. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 23 REFERENCES
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
TLDR
This paper introduces the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and proposes Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data. Expand
Annotation Artifacts in Natural Language Inference Data
TLDR
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
Adversarial Examples for Evaluating Reading Comprehension Systems
TLDR
This work proposes an adversarial evaluation scheme for the Stanford Question Answering Dataset that tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences without changing the correct answer or misleading humans. Expand
Improving Language Understanding by Generative Pre-Training
TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. Expand
The Curious Case of Neural Text Degeneration
TLDR
By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence. Expand
Enhanced LSTM for Natural Language Inference
TLDR
A new state-of-the-art result is presented, achieving the accuracy of 88.6% on the Stanford Natural Language Inference Dataset, and it is demonstrated that carefully designing sequential inference models based on chain LSTMs can outperform all previous models. Expand
Hypothesis Only Baselines in Natural Language Inference
TLDR
This approach, which is referred to as a hypothesis-only model, is able to significantly outperform a majority-class baseline across a number of NLI datasets and suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context. Expand
Synthetic and Natural Noise Both Break Neural Machine Translation
TLDR
It is found that a model based on a character convolutional neural network is able to simultaneously learn representations robust to multiple kinds of noise, including structure-invariant word representations and robust training on noisy texts. Expand
...
1
2
3
...