Probing Natural Language Inference Models through Semantic Fragments

  title={Probing Natural Language Inference Models through Semantic Fragments},
  author={Kyle Richardson and Hai Hu and Lawrence S. Moss and Ashish Sabharwal},
Do state-of-the-art models for language understanding already have, or can they easily learn, abilities such as boolean coordination, quantification, conditionals, comparatives, and monotonicity reasoning (i.e., reasoning about word substitutions in sentential contexts)? While such phenomena are involved in natural language inference (NLI) and go beyond basic linguistic understanding, it is unclear the extent to which they are captured in existing NLI benchmarks and effectively learned by… 

Figures and Tables from this paper

Logical Inferences with Comparatives and Generalized Quantifiers

This paper presents a compositional semantics that maps various comparative constructions in English to semantic representations via Combinatory Categorial Grammar parsers and combines it with an inference system based on automated theorem proving that outperforms previous logic-based systems as well as recent deep learning-based models.

SyGNS: A Systematic Generalization Testbed Based on Natural Language Semantics

This work proposes a Systematic Generalization testbed based on Natural language Semantics (SyGNS), whose challenge is to map natural language sentences to multiple forms of scoped meaning representations, designed to account for various semantic phenomena.

Towards Coinductive Models for Natural Language Understanding. Bringing together Deep Learning and Deep Semantics

It is argued that the known individual limitations of induction and coinduction can be overcome in empirical settings by a combination of the the two methods.

Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation

It is found that models trained on general-purpose NLI datasets fail systematically on MoNLI examples containing negation, but that MoNNI fine-tuning addresses this failure, suggesting that the BERT model at least partially embeds a theory of lexical entailment and negation at an algorithmic level.

Exploring Transitivity in Neural NLI Models through Veridicality

It is found that current NLI models do not perform consistently well on transitivity inference tasks, suggesting that they lack the generalization capacity for drawing composite inferences from provided training examples.

Probing Linguistic Systematicity

Evidence that current state-of-the-art NLU systems do not generalize systematically, despite overall high performance is provided.

Polish Natural Language Inference and Factivity - an Expert-based Dataset and Benchmarks

A new dataset that focuses exclusively on the factivity phenomenon is contributed and BERT-based models consuming only the input sentences show that they capture most of the complexity of NLI/factivity.

Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition

It is found that BERT learns to draw pragmatic inferences, and NLI training encourages models to learn some, but not all, pragmaticinferences.

Supporting Context Monotonicity Abstractions in Neural NLI Models

This work reframe the problem of context monotonicity classification to make it compatible with transformer-based pre-trained NLI models and adds this task to the training pipeline, and introduces a sound and complete simplifiedmonotonicity logic formalism which describes the treatment of contexts as abstract units.

A Multilingual Benchmark for Probing Negation-Awareness with Minimal Pairs

A benchmark collection of NLI examples that are grammatical and correctly labeled, as a result of manual inspection and reformulation is presented to probe the negation-awareness of multilingual language models and finds that models that correctly predict examples with negation cues, often fail to correctly predict their counter-examples withoutnegation cues.



Using syntactical and logical forms to evaluate textual inference competence

This work evaluates two kinds of neural models that implicitly exploit language structure: recurrent models and the Transformer network BERT, and shows that although BERT is clearly more efficient to generalize over most logical forms, there is space for improvement when dealing with counting operators.

A logical-based corpus for cross-lingual evaluation

This work evaluates two kinds of deep learning models that implicitly exploit language structure: recurrent models and the Transformer network BERT and shows that although BERT is clearly more efficient to generalize over most logical forms, there is space for improvement when dealing with counting operators.

Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs

It is concluded that a variety of methods is necessary to reveal all relevant aspects of a model’s grammatical knowledge in a given domain.

A large annotated corpus for learning natural language inference

The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

Annotation Artifacts in Natural Language Inference Data

It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.

Enhanced LSTM for Natural Language Inference

This paper presents a new state-of-the-art result, achieving the accuracy of 88.6% on the Stanford Natural Language Inference Dataset, and demonstrates that carefully designing sequential inference models based on chain LSTMs can outperform all previous models.

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

There is substantial room for improvement in NLI systems, and the HANS dataset can motivate and measure progress in this area, which contains many examples where the heuristics fail.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

MonaLog: a Lightweight System for Natural Language Inference Based on Monotonicity

It is shown that MonaLog is capable of generating large amounts of high-quality training data for BERT, improving its accuracy on SICK and used in combination with the current state-of-the-art model BERT in a variety of settings, including for compositional data augmentation.

Stress Test Evaluation for Natural Language Inference

This work proposes an evaluation methodology consisting of automatically constructed “stress tests” that allow us to examine whether systems have the ability to make real inferential decisions, and reveals strengths and weaknesses of these models with respect to challenging linguistic phenomena.