Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

@article{Geva2019AreWM,
  title={Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets},
  author={Mor Geva and Yoav Goldberg and Jonathan Berant},
  journal={ArXiv},
  year={2019},
  volume={abs/1908.07898}
}
Crowdsourcing has been the prevalent paradigm for creating natural language understanding datasets in recent years. [...] Key Result Our findings suggest that annotator bias should be monitored during dataset creation, and that test set annotators should be disjoint from training set annotators.Expand
Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks
TLDR
It is argued that dataset creators should explicitly aim for one or the other of the two contrasting paradigms for data annotation to facilitate the intended use of their dataset.
Toward Annotator Group Bias in Crowdsourcing
TLDR
It is revealed that annotators within the same demographic group tend to show consistent group bias in annotation tasks and thus a novel probabilistic graphical framework GroupAnno is developed to capture annotator group bias with a new extended Expectation Maximization (EM) training algorithm.
What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?
TLDR
It is found that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty and that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data.
Annotation Curricula to Implicitly Train Non-Expert Annotators
TLDR
The results show that using a simple heuristic to order instances can already significantly reduce the total annotation time while preserving a high annotation quality, and can provide a novel way to improve data collection.
Evaluating NLP Models via Contrast Sets
TLDR
A new annotation paradigm for NLP is proposed that helps to close systematic gaps in the test data, and it is recommended that after a dataset is constructed, the dataset authors manually perturb the test instances in small but meaningful ways that change the gold label, creating contrast sets.
Investigating Annotator Bias with a Graph-Based Approach
TLDR
This study wants to investigate annotator bias — a form of bias that annotators cause due to different knowledge in regards to the task and their subjective perception, and build a graph based on the annotations from the different annotators and apply a community detection algorithm to group the annotators.
Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection
TLDR
It is shown that model performance is substantially improved using this approach, and models trained on later rounds of data collection perform better on test sets and are harder for annotators to trick.
A guide to the dataset explosion in QA, NLI, and commonsense reasoning
TLDR
This tutorial aims to provide an up-to-date guide to the recent datasets, survey the old and new methodological issues with dataset construction, and outline the existing proposals for overcoming them.
Ground-Truth, Whose Truth? - Examining the Challenges with Annotating Toxic Text Datasets
TLDR
Re-annotate samples from three toxic text datasets and find that a multi-label approach to annotating toxic text samples can help to improve dataset quality and capture dependence on context and diversity in annotators.
Simple but effective techniques to reduce biases
TLDR
This work introduces an additional lightweight bias-only model which learns dataset biases and uses its prediction to adjust the loss of the base model to reduce the biases.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 29 REFERENCES
Comparing Bayesian Models of Annotation
TLDR
Six models of annotation are analyzed, covering different approaches to annotator ability, item difficulty, and parameter pooling (tying) across annotators and items, using four datasets with varying degrees of noise in the form of random annotators.
Annotation Artifacts in Natural Language Inference Data
TLDR
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.
Crowdsourcing for NLP
TLDR
This work introduces crowdsourcing and describes how it is being used in both industry and academia, and introduces different crowdsourcing platforms, review privacy and institutional review board issues, and provides rules of thumb for cost and time estimates.
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
TLDR
The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.
A large annotated corpus for learning natural language inference
TLDR
The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.
NewsQA: A Machine Comprehension Dataset
TLDR
NewsQA, a challenging machine comprehension dataset of over 100,000 human-generated question-answer pairs, is presented and analysis confirms that NewsQA demands abilities beyond simple word matching and recognizing textual entailment.
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
TLDR
This paper proposes a set of best practice guidelines for crowdsourcing methods for corpus acquisition and introduces GATE Crowd, a plugin of the GATE platform that relies on these guidelines and offers tool support for using crowdsourcing in a more principled and efficient manner.
SQuAD: 100,000+ Questions for Machine Comprehension of Text
TLDR
A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
TLDR
This work presents CommonsenseQA: a challenging new dataset for commonsense question answering, which extracts from ConceptNet multiple target concepts that have the same semantic relation to a single source concept.
Hypothesis Only Baselines in Natural Language Inference
TLDR
This approach, which is referred to as a hypothesis-only model, is able to significantly outperform a majority-class baseline across a number of NLI datasets and suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context.
...
1
2
3
...