Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

  title={Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference},
  author={R. Thomas McCoy and Ellie Pavlick and Tal Linzen},
A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. [] Key Method To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting…

Logical Reasoning with Span-Level Predictions for Interpretable and Robust NLI Models

This work introduces a logical reasoning framework for NLI, creating highly transparent model decisions that are based on logical rules, and shows that improved interpretability can be achieved without decreasing the predictive accuracy.

Logical Reasoning with Span Predictions: Span-level Logical Atoms for Interpretable and Robust NLI Models

This work introduces a logical reasoning framework for NLI, creating highly transparent model decisions that are based on logical rules and shows that the improved interpretability can be achieved without decreasing the predictive accuracy.

Exploring Lexical Irregularities in Hypothesis-Only Models of Natural Language Inference

This work analyzes hypothesis-only models trained on one of the recast datasets provided in Poliak et al.

Probing Natural Language Inference Models through Semantic Fragments

This work proposes the use of semantic fragments—systematically generated datasets that each target a different semantic phenomenon—for probing, and efficiently improving, such capabilities of linguistic models.

Syntactic Data Augmentation Increases Robustness to Inference Heuristics

The best-performing augmentation method, subject/object inversion, improved BERT’s accuracy on controlled examples that diagnose sensitivity to word order from 0.28 to 0.73, suggesting that augmentation causes BERT to recruit abstract syntactic representations.

Overcoming the Lexical Overlap Bias Using Predicate-Argument Structures

  • Computer Science
  • 2019
The lexical overlap bias is investigated and the incorporation of predicate-argument structures during finetuning considerably improves the robustness, e.g., about 20pp on discriminating different named entities, while it incurs no additional cost at the test time and does not require changing the model or the training procedure.

IMPLI: Investigating NLI Models’ Performance on Figurative Language

IMPLI is introduced, an English dataset consisting of paired sentences spanning idioms and metaphors and it is shown that while NLI models can reliably detect entailment relationship between figurative phrases with their literal counterparts, they perform poorly on similarly structured examples where pairs are designed to be non-entailing.

Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition

It is found that BERT learns to draw pragmatic inferences, and NLI training encourages models to learn some, but not all, pragmaticinferences.

Natural Language Inference in Context - Investigating Contextual Reasoning over Long Texts

ConTRoL is a new dataset for ConTextual Reasoning over Long texts, a passage-level NLI dataset with a focus on complex contextual reasoning types such as logical reasoning, derived from competitive selection and recruitment test for police recruitment with expert level quality.

Beyond Distributional Hypothesis: Let Language Models Learn Meaning-Text Correspondence

A novel intermediate training task, named meaning-matching, designed to directly learn a meaning-text correspondence, is proposed that enables PLMs to learn lexical semantic information and is found to be a safe intermediate task that guarantees a similar or better performance of downstream tasks.



Annotation Artifacts in Natural Language Inference Data

It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.

A large annotated corpus for learning natural language inference

The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

Do latent tree learning models identify meaningful structure in sentences?

This paper replicates two latent tree learning models in a shared codebase and finds that only one of these models outperforms conventional tree-structured models on sentence classification, and its parsing strategies are not especially consistent across random restarts.

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.

Breaking NLI Systems with Sentences that Require Simple Lexical Inferences

A new NLI test set is created that shows the deficiency of state-of-the-art models in inferences that require lexical and world knowledge, demonstrating that these systems are limited in their generalization ability.

Enhanced LSTM for Natural Language Inference

This paper presents a new state-of-the-art result, achieving the accuracy of 88.6% on the Stanford Natural Language Inference Dataset, and demonstrates that carefully designing sequential inference models based on chain LSTMs can outperform all previous models.

Evaluating Compositionality in Sentence Embeddings

This work presents a new set of NLI sentence pairs that cannot be solved using only word-level knowledge and instead require some degree of compositionality, and finds that augmenting the training dataset with a new dataset improves performance on a held-out test set without loss of performance on the SNLI test set.

Behavior Analysis of NLI Models: Uncovering the Influence of Three Factors on Robustness

It is shown that evaluations of NLI models can benefit from studying the influence of factors intrinsic to the models or found in the dataset used, and three factors are identified - insensitivity, polarity and unseen pairs - and their impact on three SNLI models under a variety of conditions.

Hypothesis Only Baselines in Natural Language Inference

This approach, which is referred to as a hypothesis-only model, is able to significantly outperform a majority-class baseline across a number of NLI datasets and suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context.

A Fast Unified Model for Parsing and Sentence Understanding

The Stack-augmentedParser-Interpreter NeuralNetwork (SPINN) combines parsing and interpretation within a single tree-sequence hybrid model by integrating tree-structured sentence interpretation into the linear sequential structure of a shiftreduceparser.