Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

  title={Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference},
  author={R. Thomas McCoy and Ellie Pavlick and Tal Linzen},
A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. [] Key Method To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting…

Exploring Lexical Irregularities in Hypothesis-Only Models of Natural Language Inference

This work analyzes hypothesis-only models trained on one of the recast datasets provided in Poliak et al.

Probing Natural Language Inference Models through Semantic Fragments

This work proposes the use of semantic fragments—systematically generated datasets that each target a different semantic phenomenon—for probing, and efficiently improving, such capabilities of linguistic models.

Syntactic Data Augmentation Increases Robustness to Inference Heuristics

The best-performing augmentation method, subject/object inversion, improved BERT’s accuracy on controlled examples that diagnose sensitivity to word order from 0.28 to 0.73, suggesting that augmentation causes BERT to recruit abstract syntactic representations.

Overcoming the Lexical Overlap Bias Using Predicate-Argument Structures

  • Computer Science
  • 2019
The lexical overlap bias is investigated and the incorporation of predicate-argument structures during finetuning considerably improves the robustness, e.g., about 20pp on discriminating different named entities, while it incurs no additional cost at the test time and does not require changing the model or the training procedure.

Identifying inherent disagreement in natural language inference

This paper investigates how to tease systematic inferences apart from disagreement items, and proposes Artificial Annotators (AAs) to simulate the uncertainty in the annotation process by capturing the modes in annotations.

IMPLI: Investigating NLI Models’ Performance on Figurative Language

IMPLI is introduced, an English dataset consisting of paired sentences spanning idioms and metaphors and it is shown that while NLI models can reliably detect entailment relationship between figurative phrases with their literal counterparts, they perform poorly on similarly structured examples where pairs are designed to be non-entailing.

Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition

It is found that BERT learns to draw pragmatic inferences, and NLI training encourages models to learn some, but not all, pragmaticinferences.

Natural Language Inference in Context - Investigating Contextual Reasoning over Long Texts

ConTRoL is a new dataset for ConTextual Reasoning over Long texts, a passage-level NLI dataset with a focus on complex contextual reasoning types such as logical reasoning, derived from competitive selection and recruitment test for police recruitment with expert level quality.

Analyzing machine-learned representations: A natural language case study

representations of sentences in one such artificial system for natural language processing are studied to reveal parallels to the analogous representations in people, which suggest new ways to understand psychological phenomena in humans and informs best strategies for building artificial intelligence with human-like language understanding.

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

This paper pre-train MLMs on sentences with randomly shuffled word order, and shows that these models still achieve high accuracy after fine-tuning on many downstream tasks—including tasks specifically designed to be challenging for models that ignore word order.



Natural language inference

This dissertation explores a range of approaches to NLI, beginning with methods which are robust but approximate, and proceeding to progressively more precise approaches, and greatly extends past work in natural logic to incorporate both semantic exclusion and implicativity.

Stress Test Evaluation for Natural Language Inference

This work proposes an evaluation methodology consisting of automatically constructed “stress tests” that allow us to examine whether systems have the ability to make real inferential decisions, and reveals strengths and weaknesses of these models with respect to challenging linguistic phenomena.

Non-entailed subsequences as a challenge for natural language inference

Neural network models have shown great success at natural language inference (NLI), the task of determining whether a premise entails a hypothesis. However, recent studies suggest that these models

Annotation Artifacts in Natural Language Inference Data

It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.

A large annotated corpus for learning natural language inference

The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

Do latent tree learning models identify meaningful structure in sentences?

This paper replicates two latent tree learning models in a shared codebase and finds that only one of these models outperforms conventional tree-structured models on sentence classification, and its parsing strategies are not especially consistent across random restarts.

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.

Breaking NLI Systems with Sentences that Require Simple Lexical Inferences

A new NLI test set is created that shows the deficiency of state-of-the-art models in inferences that require lexical and world knowledge, demonstrating that these systems are limited in their generalization ability.

Enhanced LSTM for Natural Language Inference

This paper presents a new state-of-the-art result, achieving the accuracy of 88.6% on the Stanford Natural Language Inference Dataset, and demonstrates that carefully designing sequential inference models based on chain LSTMs can outperform all previous models.

Evaluating Compositionality in Sentence Embeddings

This work presents a new set of NLI sentence pairs that cannot be solved using only word-level knowledge and instead require some degree of compositionality, and finds that augmenting the training dataset with a new dataset improves performance on a held-out test set without loss of performance on the SNLI test set.