Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

  title={Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets},
  author={Mor Geva and Yoav Goldberg and Jonathan Berant},
Crowdsourcing has been the prevalent paradigm for creating natural language understanding datasets in recent years. [] Key Result Our findings suggest that annotator bias should be monitored during dataset creation, and that test set annotators should be disjoint from training set annotators.

Figures and Tables from this paper

Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions

This work hypothesizes that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write many similar examples that are then over-represented in the collected data, and studies this form of bias in 14 recent NLU benchmarks.

The Sensitivity of Annotator Bias to Task Definitions in Argument Mining

This paper presents an annotation experiment that is the first to examine the extent to which social bias is sensitive to how data is annotated, and shows that annotations exhibit widely different levels of group disparity depending on which guidelines annotators follow.

Analyzing the Effects of Annotator Gender across NLP Tasks

This work hypothesizes that gender may correlate with differences in annotations for a number of NLP benchmarks, including those that are fairly subjective and those typically considered to be objective, and develops a robust framework to test for differences in annotation across genders.

Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks

It is argued that dataset creators should explicitly aim for one or the other of the descriptive or prescriptive paradigms for data annotation to facilitate the intended use of their dataset.

Toward Annotator Group Bias in Crowdsourcing

It is revealed that annotators within the same demographic group tend to show consistent group bias in annotation tasks and thus a novel probabilistic graphical framework GroupAnno is developed to capture annotator group bias with an extended Expectation Maximization (EM) algorithm.

What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

It is found that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty and that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data.

Evaluating NLP Models via Contrast Sets

A new annotation paradigm for NLP is proposed that helps to close systematic gaps in the test data, and it is recommended that after a dataset is constructed, the dataset authors manually perturb the test instances in small but meaningful ways that change the gold label, creating contrast sets.

Investigating Annotator Bias with a Graph-Based Approach

This study wants to investigate annotator bias — a form of bias that annotators cause due to different knowledge in regards to the task and their subjective perception, and build a graph based on the annotations from the different annotators and apply a community detection algorithm to group the annotators.

Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection

This work provides a new dataset of 40,000 entries, generated and labelled by trained annotators over four rounds of dynamic data creation, and shows that model performance is substantially improved using this approach.

Building Low-Resource NER Models Using Non-Speaker Annotations

This work proposes a complementary approach to building low-resource Named Entity Recognition models using “non-speaker” (NS) annotations, provided by annotators with no prior experience in the target language, and shows that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.



Comparing Bayesian Models of Annotation

Six models of annotation are analyzed, covering different approaches to annotator ability, item difficulty, and parameter pooling (tying) across annotators and items, using four datasets with varying degrees of noise in the form of random annotators.

Annotation Artifacts in Natural Language Inference Data

It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.

Crowdsourcing for NLP

This work introduces crowdsourcing and describes how it is being used in both industry and academia, and introduces different crowdsourcing platforms, review privacy and institutional review board issues, and provides rules of thumb for cost and time estimates.

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.

A large annotated corpus for learning natural language inference

The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

NewsQA: A Machine Comprehension Dataset

NewsQA, a challenging machine comprehension dataset of over 100,000 human-generated question-answer pairs, is presented and analysis confirms that NewsQA demands abilities beyond simple word matching and recognizing textual entailment.

Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines

This paper proposes a set of best practice guidelines for crowdsourcing methods for corpus acquisition and introduces GATE Crowd, a plugin of the GATE platform that relies on these guidelines and offers tool support for using crowdsourcing in a more principled and efficient manner.

SQuAD: 100,000+ Questions for Machine Comprehension of Text

A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

This work presents CommonsenseQA: a challenging new dataset for commonsense question answering, which extracts from ConceptNet multiple target concepts that have the same semantic relation to a single source concept.

Hypothesis Only Baselines in Natural Language Inference

This approach, which is referred to as a hypothesis-only model, is able to significantly outperform a majority-class baseline across a number of NLI datasets and suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context.