• Corpus ID: 56487787

A Case for a Range of Acceptable Annotations

  title={A Case for a Range of Acceptable Annotations},
  author={Jennimaria Palomaki and Olivia Rhinehart and Michael Tseng},
Multi-way annotation is often used to ensure data quality in crowdsourced annotation tasks. Each item is annotated redundantly and the contributors’ judgments are converted into a single “ground truth” label or more complex annotation through a resolution technique (e.g., on the basis of majority or plurality). Recent crowdsourcing research has argued against the notion of a single “ground truth” annotation for items in semantically oriented tasks—that is, we should accept the aggregated… 

Tables from this paper

The Extraordinary Failure of Complement Coercion Crowdsourcing
It is concluded that specific phenomena require tailored solutions, not only in specialized algorithms, but also in data collection methods.
Inherent Disagreements in Human Textual Inferences
It is argued for a refined evaluation objective that requires models to explicitly capture the full distribution of plausible human judgments to reflect the type of uncertainty present in human disagreements.
“I’ll be there for you”: The One with Understanding Indirect Answers
This paper introduces a new English corpus to study the problem of understanding indirect answers, and presents a set of experiments in which Convolutional Neural Networks are evaluated for this task, including cross-dataset evaluation and experiments with learning from disagreements in annotation.
Interrater Disagreement Resolution: A Systematic Procedure to Reach Consensus in Annotation Tasks
We present a systematic procedure for interrater disagreement resolution. The procedure is general, but of particular use in multiple-annotator tasks geared towards ground truth construction. We
Annotation Difficulties in Natural Language Inference
An experiment based on a small subset of the NLI corpora reveals that some inference cases are inherently harder to annotate than others, although good-quality guidelines can reduce this difficulty to some extent.
Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application
A general method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning with the Constructing Measures approach to faceted Rasch item response theory, leading to a new form of model interpretation because each continuous prediction can be directly explained by the constituent components in the penultimate layer.


Crowd Truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard
A new type of ground truth is proposed, a crowd truth, which is richer in diversity of perspectives and interpretations, and reflects more realistic human knowledge, and a framework to exploit such diverse human responses to annotation tasks for analyzing and understanding disagreement is proposed.
Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks
This paper introduces the workflow design pattern of crowd parting: separating workers based on shared patterns in responses to a crowdsourcing task, and illustrates this idea using an automated clustering-based method to identify divergent, but valid, worker interpretations in crowdsourced entity annotations collected over two distinct corpora.
Crowdsourcing Disagreement for Collecting Semantic Annotation
This paper proposes an approach to gathering semantic annotation, which rejects the notion that human interpretation can have a single ground truth, and is instead based on the observation that
Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation
A new theory of truth, crowd truth, is proposed that is based on the intuition that human interpretation is subjective, and that measuring annotations on the same objects of interpretation across a crowd will provide a useful representation of their subjectivity and the range of reasonable interpretations.
Measuring Crowd Truth for Medical Relation Extraction
This paper presents a framework for continuously gathering, analyzing and understanding large amounts of gold standard annotation disagreement data, and discusses the experimental results demonstrating that there is useful information in human disagreement on annotation tasks.
The VU Sound Corpus: Adding More Fine-grained Annotations to the Freesound Database
A collection of annotations for a set of 2,133 environmental sounds taken from the Freesound database is presented, finding that it is not only feasible to perform crowd-labeling for a large collection of sounds, but it is also very useful to highlight different aspects of the sounds that authors may fail to mention.
Linguistic Wisdom from the Crowd
Two approaches to linguistic data collection corresponding to these differing goals (model-driven and user-driven) are defined and exemplified and some hybrid cases in which they overlap are discussed.
New Insights from Coarse Word Sense Disambiguation in the Crowd
Surprising features which drive differential WSD accuracy are found: the number of rephrasings within a sense definition is associated with higher accuracy, and as word frequency increases, accuracy decreases even if thenumber of senses is kept constant.
Crowdsourcing user studies with Mechanical Turk
Although micro-task markets have great potential for rapidly collecting user measurements at low costs, it is found that special care is needed in formulating tasks in order to harness the capabilities of the approach.
The future of crowd work
This paper outlines a framework that will enable crowd work that is complex, collaborative, and sustainable, and lays out research challenges in twelve major areas: workflow, task assignment, hierarchy, real-time response, synchronous collaboration, quality control, crowds guiding AIs, AIs guiding crowds, platforms, job design, reputation, and motivation.