What Can We Learn from Collective Human Opinions on Natural Language Inference Data?

@article{Nie2020WhatCW,
  title={What Can We Learn from Collective Human Opinions on Natural Language Inference Data?},
  author={Yixin Nie and Xiang Zhou and Mohit Bansal},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.03532}
}
Despite the subjective nature of many NLP tasks, most NLU evaluations have focused on using the majority label with presumably high agreement as the ground truth. Less attention has been paid to the distribution of human opinions. We collect ChaosNLI, a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS in oft-used NLI evaluation sets. This dataset is created by collecting 100 annotations per example for 3,113 examples in SNLI and MNLI and 1,532 examples in Abductive… 

Distributed NLI: Learning to Predict Human Opinion Distributions for Language Reasoning

We introduce distributed NLI, a new NLU task with a goal to predict the distribution of human judgements for natural language inference. We show that by applying additional distribution estimation

Capture Human Disagreement Distributions by Calibrated Networks for Natural Language Inference

The overhead of collecting gold ambiguity labels can be cut, by broadly solving how to calibrate the NLI network, when the model has naturally captured the human ambiguity distribution as long as it’s calibrated, i.e. the predictive probability can reflect the true correctness likelihood.

Embracing Ambiguity: Shifting the Training Target of NLI Models

This paper prepares AmbiNLI, a trial dataset obtained from readily available sources, and shows it is possible to reduce ChaosNLI divergence scores when finetuning on this data, a promising first step towards learning how to capture linguistic ambiguity.

Learning with Different Amounts of Annotation: From Zero to Many Labels

This work proposes a learning algorithm that can learn from training examples with different amount of annotation (with zero, one, or multiple labels), and efficiently combines signals from uneven training data and brings additional gains in low annotation budget and cross domain settings.

The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

This position paper reconciles different previously proposed notions of human label variation, provides a repository of publicly-available datasets with un-aggregated labels, and depicts approaches proposed so far, identify gaps and suggest ways forward.

Investigating Reasons for Disagreement in Natural Language Inference

Two modeling approaches for detecting items with potential disagreement are explored: a 4-way classification with a “Complicated” label in addition to the three standard NLI labels, and a multilabel classification approach that is more expressive and gives better recall of the possible interpretations in the data.

ANLIzing the Adversarial Natural Language Inference Dataset

We perform an in-depth error analysis of Adversarial NLI (ANLI), a recently introduced large-scale human-and-model-in-the-loop natural language inference dataset collected over multiple rounds. We

Temporal-aware Language Representation Learning From Crowdsourced Labels

TACMA, a temporal-aware language representation learning heuristic for crowdsourced labels with multiple annotators, is proposed and shows that the approach outperforms a wide range of state-of-the-art baselines in terms of prediction accuracy and AUC.

Learning from Uneven Training Data: Unlabeled, Single Label, and Multiple Labels

This work proposes a learning algorithm that can learn from uneven training examples (with zero, one, or multiple labels), and achieves consistent gains in both accuracy and label distribution metrics in two tasks, suggesting training with uneven training data can be beneficial for many NLP tasks.

The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

It is found that the performance of state-of-the-art models on Natural Language Inference and Reading Comprehension analysis/stress sets can be highly unstable, with both theoretical explanations and empirical evidence regarding the source of the instability.

References

SHOWING 1-10 OF 37 REFERENCES

Behavior Analysis of NLI Models: Uncovering the Influence of Three Factors on Robustness

It is shown that evaluations of NLI models can benefit from studying the influence of factors intrinsic to the models or found in the dataset used, and three factors are identified - insensitivity, polarity and unseen pairs - and their impact on three SNLI models under a variety of conditions.

Uncertain Natural Language Inference

The feasibility of collecting annotations for UNLI is demonstrated by relabeling a portion of the SNLI dataset under a probabilistic scale, where items even with the same categorical label differ in how likely people judge them to be true given a premise.

Learning part-of-speech taggers with inter-annotator agreement loss

This paper uses small samples of doublyannotated part-of-speech data for Twitter to estimate annotation reliability and shows how those metrics of likely interannotator agreement can be implemented in the loss functions of POS taggers, finding that cost-sensitive algorithms perform better across annotation projects and even on data annotated according to the same guidelines.

Ordinal Common-sense Inference

This work describes a framework for extracting common-sense knowledge from corpora, which is then used to construct a dataset for this ordinal entailment task, and annotates subsets of previously established datasets via the ordinal annotation protocol in order to analyze the distinctions between these and what is constructed.

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

This paper introduces the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and proposes Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data.

Inherent Disagreements in Human Textual Inferences

It is argued for a refined evaluation objective that requires models to explicitly capture the full distribution of plausible human judgments to reflect the type of uncertainty present in human disagreements.

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

This work presents CommonsenseQA: a challenging new dataset for commonsense question answering, which extracts from ConceptNet multiple target concepts that have the same semantic relation to a single source concept.

A large annotated corpus for learning natural language inference

The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

Adversarial NLI: A New Benchmark for Natural Language Understanding

This work introduces a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure, and shows that non-expert annotators are successful at finding their weaknesses.