Embracing Ambiguity: Shifting the Training Target of NLI Models

@article{Meissner2021EmbracingAS,
  title={Embracing Ambiguity: Shifting the Training Target of NLI Models},
  author={Johannes Mario Meissner and Napat Thumwanit and Saku Sugawara and Akiko Aizawa},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.03020}
}
Natural Language Inference (NLI) datasets contain examples with highly ambiguous labels. While many research works do not pay much attention to this fact, several recent efforts have been made to acknowledge and embrace the existence of ambiguity, such as UNLI and ChaosNLI. In this paper, we explore the option of training directly on the estimated label distribution of the annotators in the NLI task, using a learning loss based on this ambiguity distribution instead of the gold-labels. We… 

Figures and Tables from this paper

Capture Human Disagreement Distributions by Calibrated Networks for Natural Language Inference

TLDR
The overhead of collecting gold ambiguity labels can be cut, by broadly solving how to calibrate the NLI network, when the model has naturally captured the human ambiguity distribution as long as it’s calibrated, i.e. the predictive probability can reflect the true correctness likelihood.

Distributed NLI: Learning to Predict Human Opinion Distributions for Language Reasoning

We introduce distributed NLI, a new NLU task with a goal to predict the distribution of human judgements for natural language inference. We show that by applying additional distribution estimation

Mitigating Dataset Artifacts in Natural Language Inference Through Automatic Contextual Data Augmentation and Learning Optimization

TLDR
This paper presents a novel data augmentation technique and combines it with a unique learning procedure for that task, and proves that acda-boosted pre-trained language models that employ the combined approach consistently outperform the respective fine-tuned baseline pre- trained language models across both benchmark datasets and adversarial examples.

Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios

TLDR
This study frames this task by asking multiple questions with the same set of possible endings as candidate answers, given a short story text, and discovers that even current strong pretrained language models struggle to answer the questions consistently.

An Understanding-Oriented Robust Machine Reading Comprehension Model

TLDR
This paper proposes an understanding-oriented machine reading comprehension model to address three kinds of robustness issues, which are over sensitivity, over stability and generalization, which is integrated with a multi-task learning based method.

Symptom Identification for Interpretable Detection of Multiple Mental Disorders

Mental disease detection (MDD) from social media has suffered from poor generalizability and interpretability, due to lack of symptom modeling. This paper introduces PsySym , the first annotated

References

SHOWING 1-10 OF 11 REFERENCES

What Can We Learn from Collective Human Opinions on Natural Language Inference Data?

TLDR
This work collects ChaosNLI, a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS in oft-used NLI evaluation sets, and argues for a detailed examination of human agreement in future data collection efforts, and evaluating model outputs against the distribution over collective human opinions.

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

TLDR
The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.

Uncertain Natural Language Inference

TLDR
The feasibility of collecting annotations for UNLI is demonstrated by relabeling a portion of the SNLI dataset under a probabilistic scale, where items even with the same categorical label differ in how likely people judge them to be true given a premise.

A large annotated corpus for learning natural language inference

TLDR
The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Inherent Disagreements in Human Textual Inferences

TLDR
It is argued for a refined evaluation objective that requires models to explicitly capture the full distribution of plausible human judgments to reflect the type of uncertainty present in human disagreements.

Building an Evaluation Scale using Item Response Theory

TLDR
The proposed Item Response Theory from psychometrics is shown to be able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task.

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

TLDR
The results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization, and a model-based tool to characterize and diagnose datasets.

Learning Word Vectors for Sentiment Analysis

TLDR
This work presents a model that uses a mix of unsupervised and supervised techniques to learn word vectors capturing semantic term--document information as well as rich sentiment content, and finds it out-performs several previously introduced methods for sentiment classification.

Abductive Commonsense Reasoning

TLDR
This study introduces a challenge dataset, ART, that consists of over 20k commonsense narrative contexts and 200k explanations, and conceptualizes two new tasks -- Abductive NLI: a multiple-choice question answering task for choosing the more likely explanation, and Abduction NLG: a conditional generation task for explaining given observations in natural language.