Annotation Artifacts in Natural Language Inference Data

@inproceedings{Gururangan2018AnnotationAI,
  title={Annotation Artifacts in Natural Language Inference Data},
  author={Suchin Gururangan and Swabha Swayamdipta and Omer Levy and Roy Schwartz and Samuel R. Bowman and Noah A. Smith},
  booktitle={NAACL},
  year={2018}
}
Large-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails, contradicts, or is logically neutral with respect to. [...] Key Result Our findings suggest that the success of natural language inference models to date has been overestimated, and that the task remains a hard open problem.Expand
Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options
TLDR
This work investigates two alternative protocols which automatically create candidate (premise, hypothesis) pairs for annotators to label and concludes that crowdworker writing is still the best known option for entailment data. Expand
Uncertain Natural Language Inference
TLDR
The feasibility of collecting annotations for UNLI is demonstrated by relabeling a portion of the SNLI dataset under a probabilistic scale, where items even with the same categorical label differ in how likely people judge them to be true given a premise. Expand
Generating Token-Level Explanations for Natural Language Inference
TLDR
It is shown that it is possible to generate token-level explanations for NLI without the need for training data explicitly annotated for this purpose, using a simple LSTM architecture and evaluating both LIME and Anchor explanations for this task. Expand
Stress Test Evaluation for Natural Language Inference
TLDR
This work proposes an evaluation methodology consisting of automatically constructed “stress tests” that allow us to examine whether systems have the ability to make real inferential decisions, and reveals strengths and weaknesses of these models with respect to challenging linguistic phenomena. Expand
Explaining Simple Natural Language Inference
TLDR
The experiment reveals several problems in the annotation guidelines, and various challenges of the NLI task itself, and leads to recommendations for future annotation tasks, for NLI and possibly for other tasks. Expand
Improving Generalization by Incorporating Coverage in Natural Language Inference
TLDR
This work proposes to extend the input representations with an abstract view of the relation between the hypothesis and the premise, i.e., how well the individual words, or word n-grams, of the hypothesis are covered by the premise to improve generalization. Expand
Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference
TLDR
Two probabilistic methods are proposed to build models that are more robust to hypothesis-only biases in Natural Language Inference datasets and transfer better than a baseline architecture in 9 out of 12 NLI datasets. Expand
Mitigating Annotation Artifacts in Natural Language Inference Datasets to Improve Cross-dataset Generalization Ability
TLDR
Experimental results demonstrate that the methods considered can alleviate the negative effect of the artifacts and improve the generalization ability of models. Expand
Learning Entailment-Based Sentence Embeddings from Natural Language Inference
TLDR
This work proposes a simple interaction layer based on predefined entailment and contradiction scores applied directly to the sentence embeddings, which achieves results on natural language inference competitive with MLP-based models and directly represents the information needed for textual entailment. Expand
Hypothesis Only Baselines in Natural Language Inference
TLDR
This approach, which is referred to as a hypothesis-only model, is able to significantly outperform a majority-class baseline across a number of NLI datasets and suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 33 REFERENCES
A large annotated corpus for learning natural language inference
TLDR
The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time. Expand
Hypothesis Only Baselines in Natural Language Inference
TLDR
This approach, which is referred to as a hypothesis-only model, is able to significantly outperform a majority-class baseline across a number of NLI datasets and suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context. Expand
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
TLDR
The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus. Expand
Evaluating Compositionality in Sentence Embeddings
TLDR
This work presents a new set of NLI sentence pairs that cannot be solved using only word-level knowledge and instead require some degree of compositionality, and finds that augmenting the training dataset with a new dataset improves performance on a held-out test set without loss of performance on the SNLI test set. Expand
Natural Language Inference over Interaction Space
TLDR
DIIN, a novel class of neural network architectures that is able to achieve high-level understanding of the sentence pair by hierarchically extracting semantic features from interaction space, shows that an interaction tensor (attention weight) contains semantic information to solve natural language inference. Expand
Discovery of inference rules for question-answering
TLDR
This paper presents an unsupervised algorithm for discovering inference rules from text based on an extended version of Harris’ Distributional Hypothesis, which states that words that occurred in the same contexts tend to be similar. Expand
SQuAD: 100,000+ Questions for Machine Comprehension of Text
TLDR
A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). Expand
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
TLDR
It is shown how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Expand
A SICK cure for the evaluation of compositional distributional semantic models
TLDR
This work aims to help the research community working on compositional distributional semantic models (CDSMs) by providing SICK (Sentences Involving Compositional Knowldedge), a large size English benchmark tailored for them. Expand
A Decomposable Attention Model for Natural Language Inference
We propose a simple neural architecture for natural language inference. Our approach uses attention to decompose the problem into subproblems that can be solved separately, thus making it triviallyExpand
...
1
2
3
4
...