TaxiNLI: Taking a Ride up the NLU Hill

@inproceedings{Joshi2020TaxiNLITA,
  title={TaxiNLI: Taking a Ride up the NLU Hill},
  author={Pratik M. Joshi and Somak Aditya and Aalok Sathe and Monojit Choudhury},
  booktitle={CONLL},
  year={2020}
}
Pre-trained Transformer-based neural architectures have consistently achieved state-of-the-art performance in the Natural Language Inference (NLI) task. Since NLI examples encompass a variety of linguistic, logical, and reasoning phenomena, it remains unclear as to which specific concepts are learnt by the trained systems and where they can achieve strong generalization. To investigate this question, we propose a taxonomic hierarchy of categories that are relevant for the NLI task. We introduce… 
LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI
TLDR
This work proposes an extensible framework to collectively yet categorically test diverse LOgical reasoning capabilities required for NLI (and by extension, NLU) and creates a semi-synthetic large test-bench that offers following utilities: individually test and analyze reasoning capabilities along 17 reasoning dimensions (including pragmatic reasoning).
Can Transformer Language Models Predict Psychometric Properties?
TLDR
Cases are found in which transformer-based LMs predict psychometric properties consistently well in certain categories but consistently poorly in others, thus providing new insights into fundamental similarities and differences between human and LM reasoning.
Generalized Quantifiers as a Source of Error in Multilingual NLU Benchmarks
TLDR
It is found that quantifiers are pervasive in NLU benchmarks, and their occurrence at test time is associated with performance drops, and Generalized Quantifier Theory is relied on for language-independent representations of the semantics of quantifier words.
QLEVR: A Diagnostic Dataset for Quantificational Language and Elementary Visual Reasoning
TLDR
This paper introduces a minimally biased, diagnostic visual question-answering dataset, QLEVR, that goes beyond existential and numerical quantification and focus on more complex quanti fiers and their combinations, e.g., asking whether there are more than two red balls that are smaller than at least three blue balls in an image.
Generalized Quantifiers as a Source of Error in Multilingual NLU Benchmarks
TLDR
It is found that quantifiers are pervasive in NLU benchmarks, and their occurrence at test time is associated with performance drops, and Generalized Quantifier Theory is relied on for language-independent representations of the semantics of quantifier words.
A survey of methods for revealing and overcoming weaknesses of data-driven Natural Language Understanding
TLDR
This structured survey provides an overview of the evolving research area by categorising reported weaknesses in models and datasets and the methods proposed to reveal and alleviate those weaknesses for the English language.
Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in Natural Language Understanding
TLDR
Curriculum is introduced as a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena and it is shown that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality.
IndoNLI: A Natural Language Inference Dataset for Indonesian
TLDR
IndoNLI is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerical reasoning, structural changes, idioms, or temporal and spatial reasoning.
Analyzing the Effects of Reasoning Types on Cross-Lingual Transfer Performance
TLDR
A category-annotated multilingual NLI dataset is proposed and the challenges to scale monolingual annotations to multiple languages are discussed, and interesting effects that the confluence of reasoning types and language similarities have on transfer performance are observed.
Structured Prediction in NLP - A survey
TLDR
A brief of major techniques in structured prediction and its applications in the NLP domains like parsing, sequence labeling, text generation, and sequence to sequence tasks is provided.
...
...

References

SHOWING 1-10 OF 35 REFERENCES
Adversarial NLI: A New Benchmark for Natural Language Understanding
TLDR
This work introduces a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure, and shows that non-expert annotators are successful at finding their weaknesses.
Stress Test Evaluation for Natural Language Inference
TLDR
This work proposes an evaluation methodology consisting of automatically constructed “stress tests” that allow us to examine whether systems have the ability to make real inferential decisions, and reveals strengths and weaknesses of these models with respect to challenging linguistic phenomena.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
TLDR
The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.
Enhanced LSTM for Natural Language Inference
TLDR
This paper presents a new state-of-the-art result, achieving the accuracy of 88.6% on the Stanford Natural Language Inference Dataset, and demonstrates that carefully designing sequential inference models based on chain LSTMs can outperform all previous models.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
The Role of Logic and Ontology in Language and Reasoning
  • J. Sowa
  • Philosophy, Computer Science
  • 2010
TLDR
These issues in terms of Peirce's semiotics and Wittgenstein's language games are analyzed to lead to a more dynamic, flexible, and extensible basis for ontology and its use in formal and informal reasoning.
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models
How Can We Accelerate Progress Towards Human-like Linguistic Generalization?
TLDR
This position paper describes and critiques the Pretraining-Agnostic Identically Distributed (PAID) evaluation paradigm, and advocates for supplementing or replacing PAID with paradigms that reward architectures that generalize as quickly and robustly as humans.
Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition
TLDR
It is found that BERT learns to draw pragmatic inferences, and NLI training encourages models to learn some, but not all, pragmaticinferences.
...
...