What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

@article{Nangia2021WhatIM,
  title={What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?},
  author={Nikita Nangia and Saku Sugawara and H. Trivedi and Alex Warstadt and Clara Vania and Sam Bowman},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.00794}
}
Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality. We use multiple-choice question answering as a testbed and run a randomized trial by… 

Figures and Tables from this paper

WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
TLDR
This work introduces a novel paradigm for dataset creation based on human and machine collaboration, which brings together the generative strength of language models and the eval-uativestrength of humans in order to curate NLP datasets of enhanced quality and diversity.
LMTurk: Few-Shot Learners as Crowdsourcing Workers in a Language-Model-as-a-Service Framework
TLDR
This work proposes LMTurk, a novel approach that treats few-shot learners as crowdsourcing workers, and shows that the resulting annotations can be utilized to train models that solve the task well and are small enough to be deployable in practical scenarios.
LMTurk: Few-Shot Learners as Crowdsourcing Workers
TLDR
This work proposes LMTurk, a novel approach that treats few-shot learners as crowdsourcing workers built upon PLMs as workers and shows that the resulting annotations can be utilized to train models that solve the task well and are small enough to be deployable in practical scenarios.
Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions
TLDR
This work hypothesizes that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write similar examples that are then over-represented in the collected data, and studies this form of bias, termed instruction bias, in 14 recent NLU benchmarks, showing that instruction examples often exhibit concrete patterns.
ANLIzing the Adversarial Natural Language Inference Dataset
We perform an in-depth error analysis of Adversarial NLI (ANLI), a recently introduced large-scale human-and-model-in-the-loop natural language inference dataset collected over multiple rounds. We
SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets
TLDR
This work introduces a novel method for efficient dataset curation: a large language model is used to provide seed generations to human raters, thereby changing dataset authoring from a writing task to an editing task.
WebQA: Multihop and Multimodal QA
TLDR
This work introduces WEBQA, a challenging new benchmark that proves difficult for large-scale state-of-the-art models which lack language groundable visual representations for novel objects and the ability to reason, yet trivial for humans.
Analyzing Dynamic Adversarial Training Data in the Limit
TLDR
This paper presents the first study of longer-term DADC, where 20 rounds of NLI examples for a small set of premise paragraphs are collected, with both adversarial and non-adversarial approaches.
"I'm sorry to hear that": finding bias in language models with a holistic descriptor dataset
TLDR
This work presents a new, more inclusive dataset, H OLISTIC B IAS, which consists of nearly 600 descriptor terms across 13 different demographic axes, and demonstrates that this dataset is highly efficacious for measuring previously unmea-surable biases in token likelihoods and generations from language models, as well as in an offensiveness classi fier.
QuALITY: Question Answering with Long Input Texts, Yes!
TLDR
QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process, is introduced to enable building and testing models on long-document comprehension.
...
1
2
...

References

SHOWING 1-10 OF 67 REFERENCES
Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets
TLDR
It is shown that model performance improves when training with annotator identifiers as features, and that models are able to recognize the most productive annotators and that often models do not generalize well to examples from annotators that did not contribute to the training set.
Perspectives on crowdsourcing annotations for natural language processing
TLDR
A faceted analysis of crowdsourcing from a practitioner’s perspective is provided, and how the major crowdsourcing genres fill different parts of this multi-dimensional space is summarized.
Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research
TLDR
This research, which explores the prevalence of dishonesty among crowdworkers, how workers respond to both monetary incentives and intrinsic forms of motivation, and how crowdworkers interact with each other, has immediate implications that are distill into best practices that researchers should follow when using crowdsourcing in their own research.
MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy
TLDR
This paper presents a new quality-control workflow that requires some workers to Justify their reasoning and asks others to Reconsider their decisions after reading counter-arguments from workers with opposing views, which produces much higher accuracy than simpler voting approaches for a range of budgets.
What Can We Learn from Collective Human Opinions on Natural Language Inference Data?
TLDR
This work collects ChaosNLI, a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS in oft-used NLI evaluation sets, and argues for a detailed examination of human agreement in future data collection efforts, and evaluating model outputs against the distribution over collective human opinions.
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
TLDR
This work presents CommonsenseQA: a challenging new dataset for commonsense question answering, which extracts from ConceptNet multiple target concepts that have the same semantic relation to a single source concept.
Annotation Artifacts in Natural Language Inference Data
TLDR
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.
Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria
TLDR
An empirical study is conducted to examine the effect of noisy annotations on the performance of sentiment classification models, and evaluate the utility of annotation selection on classification accuracy and efficiency.
Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks
TLDR
This work explores the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web, and proposes a technique for bias correction that significantly improves annotation quality on two tasks.
Using Worker Self-Assessments for Competence-Based Pre-Selection in Crowdsourcing Microtasks
TLDR
The results show that requesters in crowdsourcing platforms can benefit by considering worker self-assessments in addition to their performance for pre- selection, and make a case for competence-based pre-selection in crowdsourced marketplaces.
...
1
2
3
4
5
...