Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation

  title={Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation},
  author={Adam Poliak and Aparajita Haldar and Rachel Rudinger and J. Edward Hu and Ellie Pavlick and Aaron Steven White and Benjamin Van Durme},
We present a large-scale collection of diverse natural language inference (NLI) datasets that help provide insight into how well a sentence representation captures distinct types of reasoning. The collection results from recasting 13 existing datasets from 7 semantic phenomena into a common NLI structure, resulting in over half a million labeled context-hypothesis pairs in total. We refer to our collection as the DNC: Diverse Natural Language Inference Collection. The DNC is available online at… 

Tables from this paper

Transforming Question Answering Datasets Into Natural Language Inference Datasets

This work proposes a new method for automatically deriving NLI datasets from the growing abundance of large-scale question answering datasets, and relies on learning a sentence transformation model which converts question-answer pairs into their declarative forms.

Uncertain Natural Language Inference

The feasibility of collecting annotations for UNLI is demonstrated by relabeling a portion of the SNLI dataset under a probabilistic scale, where items even with the same categorical label differ in how likely people judge them to be true given a premise.

Learning Entailment-Based Sentence Embeddings from Natural Language Inference

This work proposes a simple interaction layer based on predefined entailment and contradiction scores applied directly to the sentence embeddings, which achieves results on natural language inference competitive with MLP-based models and directly represents the information needed for textual entailment.

Temporal Reasoning in Natural Language Inference

Five new natural language inference (NLI) datasets focused on temporal reasoning are introduced and four existing datasets annotated for event duration and event ordering are recast into more than one million NLI examples.

Discourse-Based Evaluation of Language Understanding

We introduce DiscEval, a compilation of $11$ evaluation datasets with a focus on discourse, that can be used for evaluation of English Natural Language Understanding when considering meaning as use.

Ultra-fine Entity Typing with Indirect Supervision from Natural Language Inference

LITE is presented, a new approach that formulates entity typing as a natural language inference (NLI) problem, making use of the indirect supervision from NLI to infer type information meaningfully represented as textual hypotheses and alleviate the data scarcity issue.

A Pragmatics-Centered Evaluation Framework for Natural Language Understanding

It is shown that natural language inference, a widely used pretraining task, does not result in genuinely universal representations, which presents a new challenge for multi-task learning.

Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches

An overview of recent benchmarks, relevant knowledge resources, and state-of-the-art learning and inference approaches in order to support a better understanding of this growing field of NLP is provided.


  • Computer Science
  • 2019
It is shown that natural language inference, a widely used pretraining task, does not result in genuinely universal representations, which opens a new challenge for multi-task learning.

Figurative Language in Recognizing Textual Entailment

A collection of recognizing textual entailment (RTE) datasets focused on figurative language is introduced, indicating that state-of-the-art models trained on popular RTE datasets might not sufficiently capture figurativelanguage.



A large annotated corpus for learning natural language inference

The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

On the Evaluation of Semantic Phenomena in Neural Machine Translation Using Natural Language Inference

A process for investigating the extent to which sentence representations arising from neural machine translation (NMT) systems encode distinct semantic phenomena is proposed, which finds its encoder appears most suited to supporting inferences at the syntax-semantics interface.

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.

Annotation Artifacts in Natural Language Inference Data

It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.

Hypothesis Only Baselines in Natural Language Inference

This approach, which is referred to as a hypothesis-only model, is able to significantly outperform a majority-class baseline across a number of NLI datasets and suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context.

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

It is shown how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks.

Stress Test Evaluation for Natural Language Inference

This work proposes an evaluation methodology consisting of automatically constructed “stress tests” that allow us to examine whether systems have the ability to make real inferential decisions, and reveals strengths and weaknesses of these models with respect to challenging linguistic phenomena.

Inference is Everything: Recasting Semantic Resources into a Unified Evaluation Framework

A general strategy to automatically generate one or more sentential hypotheses based on an input sentence and pre-existing manual semantic annotations is presented, which enables us to probe a statistical RTE model’s performance on different aspects of semantics.

Generating Entailment Rules from FrameNet

An algorithm is presented that generates inference rules between predicates from FrameNet and shows that the novel resource is effective and complements WordNet in terms of rule coverage.

Evaluating Compositionality in Sentence Embeddings

This work presents a new set of NLI sentence pairs that cannot be solved using only word-level knowledge and instead require some degree of compositionality, and finds that augmenting the training dataset with a new dataset improves performance on a held-out test set without loss of performance on the SNLI test set.