• Corpus ID: 249049412

Examining Single Sentence Label Leakage in Natural Language Inference Datasets

  title={Examining Single Sentence Label Leakage in Natural Language Inference Datasets},
  author={Michael Stephen Saxon and Xinyi Wang and Wenda Xu and William Yang Wang},
Many believe human-level natural language inference (NLI) has already been achieved. In reality, modern NLI benchmarks have serious flaws, rendering progress questionable. Chief among them is the problem of single sentence label leakage , where spurious correlations and biases in datasets enable the accurate prediction of a sentence pair relation from only a single sentence, something that should in princi-ple be impossible. This leakage enables models to cheat rather than learn the desired… 

Figures and Tables from this paper


Annotation Artifacts in Natural Language Inference Data
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.
Hypothesis Only Baselines in Natural Language Inference
This approach, which is referred to as a hypothesis-only model, is able to significantly outperform a majority-class baseline across a number of NLI datasets and suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context.
OCNLI: Original Chinese Natural Language Inference
This paper presents the first large-scale NLI dataset for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI), which follows closely the annotation protocol used for MNLI, but creates new strategies for eliciting diverse hypotheses.
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
It is shown how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks.
Adversarial NLI: A New Benchmark for Natural Language Understanding
This work introduces a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure, and shows that non-expert annotators are successful at finding their weaknesses.
Selection Bias Explorations and Debias Methods for Natural Language Sentence Matching Datasets
This paper investigates the problem of selection bias on six NLSM datasets and finds that four out of them are significantly biased, and proposes a training and evaluation framework to alleviate the bias.
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.
CLUE: A Chinese Language Understanding Evaluation Benchmark
The first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark is introduced, an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text.
NeuralLog: Natural Language Inference with Joint Neural and Logical Reasoning
This work proposes an inference framework called NeuralLog, which utilizes both a monotonicity-based logical inference engine and a neural network language model for phrase alignment, and shows that the joint logic and neural inference system improves accuracy on the NLI task and can achieve state-of-art accuracy onThe SICK and MED datasets.
What Will it Take to Fix Benchmarking in Natural Language Understanding?
It is argued most current benchmarks fail at these criteria, and that adversarially-constructed, out-of-distribution test sets does not meaningfully address the causes of these failures.