What Can We Learn from Collective Human Opinions on Natural Language Inference Data?

  title={What Can We Learn from Collective Human Opinions on Natural Language Inference Data?},
  author={Yixin Nie and Xiang Zhou and Mohit Bansal},
Despite the subjective nature of many NLP tasks, most NLU evaluations have focused on using the majority label with presumably high agreement as the ground truth. Less attention has been paid to the distribution of human opinions. We collect ChaosNLI, a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS in oft-used NLI evaluation sets. This dataset is created by collecting 100 annotations per example for 3,113 examples in SNLI and MNLI and 1,532 examples in Abductive… 

Distributed NLI: Learning to Predict Human Opinion Distributions for Language Reasoning

We introduce distributed NLI, a new NLU task with a goal to predict the distribution of human judgements for natural language inference. We show that by applying additional distribution estimation

Capture Human Disagreement Distributions by Calibrated Networks for Natural Language Inference

The overhead of collecting gold ambiguity labels can be cut, by broadly solving how to calibrate the NLI network, when the model has naturally captured the human ambiguity distribution as long as it’s calibrated, i.e. the predictive probability can reflect the true correctness likelihood.

Embracing Ambiguity: Shifting the Training Target of NLI Models

This paper prepares AmbiNLI, a trial dataset obtained from readily available sources, and shows it is possible to reduce ChaosNLI divergence scores when finetuning on this data, a promising first step towards learning how to capture linguistic ambiguity.

Learning with Different Amounts of Annotation: From Zero to Many Labels

This work proposes a learning algorithm that can learn from training examples with different amount of annotation (with zero, one, or multiple labels), and efficiently combines signals from uneven training data and brings additional gains in low annotation budget and cross domain settings.

Investigating Reasons for Disagreement in Natural Language Inference

Two modeling approaches for detecting items with potential disagreement are explored: a 4-way classification with a “Complicated” label in addition to the three standard NLI labels, and a multilabel classification approach that is more expressive and gives better recall of the possible interpretations in the data.

Curing the SICK and Other NLI Maladies

This work shows how neither the current task formulation nor the proposed uncertainty gradient are entirely suitable for solving the NLI challenges, and proposes an ordered sense space annotation, which distinguishes between logical and common-sense inference.

ANLIzing the Adversarial Natural Language Inference Dataset

We perform an in-depth error analysis of Adversarial NLI (ANLI), a recently introduced large-scale human-and-model-in-the-loop natural language inference dataset collected over multiple rounds. We

Investigating Multi-source Active Learning for Natural Language Inference

It is revealed that uncertainty-based strategies perform poorly due to the acquisition of collective outliers, i.e., hard-to-learn instances that hamper learning and generalization, and when outliers are removed, strategies are found to recover and outperform random baselines.

What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

It is found that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty and that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data.

Temporal-aware Language Representation Learning From Crowdsourced Labels

TACMA, a temporal-aware language representation learning heuristic for crowdsourced labels with multiple annotators, is proposed and shows that the approach outperforms a wide range of state-of-the-art baselines in terms of prediction accuracy and AUC.



Uncertain Natural Language Inference

The feasibility of collecting annotations for UNLI is demonstrated by relabeling a portion of the SNLI dataset under a probabilistic scale, where items even with the same categorical label differ in how likely people judge them to be true given a premise.

Learning part-of-speech taggers with inter-annotator agreement loss

This paper uses small samples of doublyannotated part-of-speech data for Twitter to estimate annotation reliability and shows how those metrics of likely interannotator agreement can be implemented in the loss functions of POS taggers, finding that cost-sensitive algorithms perform better across annotation projects and even on data annotated according to the same guidelines.

A Crowdsourced Corpus of Multiple Judgments and Disagreement on Anaphoric Interpretation

The corpus, containing annotations for about 108,000 markables, is one of the largest corpora for coreference for English, and one the largest crowdsourced NLP corpora, but its main feature is the large number of judgments per markable, which makes it a unique resource for the study of disagreements on anaphoric interpretation.

Ordinal Common-sense Inference

This work describes a framework for extracting common-sense knowledge from corpora, which is then used to construct a dataset for this ordinal entailment task, and annotates subsets of previously established datasets via the ordinal annotation protocol in order to analyze the distinctions between these and what is constructed.

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

This paper introduces the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and proposes Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data.

Inherent Disagreements in Human Textual Inferences

It is argued for a refined evaluation objective that requires models to explicitly capture the full distribution of plausible human judgments to reflect the type of uncertainty present in human disagreements.

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

This work presents CommonsenseQA: a challenging new dataset for commonsense question answering, which extracts from ConceptNet multiple target concepts that have the same semantic relation to a single source concept.

A large annotated corpus for learning natural language inference

The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

Adversarial NLI: A New Benchmark for Natural Language Understanding

This work introduces a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure, and shows that non-expert annotators are successful at finding their weaknesses.