Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks

@inproceedings{Snow2008CheapAF,
  title={Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks},
  author={Rion Snow and Brendan T. O'Connor and Dan Jurafsky and A. Ng},
  booktitle={EMNLP},
  year={2008}
}
Human linguistic annotation is crucial for many natural language processing tasks but can be expensive and time-consuming. We explore the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web. We investigate five tasks: affect recognition, word similarity, recognizing textual entailment, event temporal ordering, and word sense disambiguation. For all five, we show high agreement… 

Figures and Tables from this paper

Using the crowd for readability prediction
TLDR
It is concluded that readability assessment by comparing texts is a polyvalent methodology, which can be adapted to specific domains and target audiences if required.
Amazon Mechanical Turk for Subjectivity Word Sense Disambiguation
TLDR
Using MTurk to collect annotations for Subjectivity Word Sense Disambiguation (SWSD), a coarse-grained word sense disambigsuation task, is investigated, suggesting a greater role for MTurK with respect to constructing a large scale SWSD system in the future, promising substantial improvement in subjectivity and sentiment analysis.
Facilitating Corpus Annotation by Improving Annotation Aggregation
TLDR
CSLDA, a novel annotation aggregation model that improves on the state of the art for a variety of annotation tasks, and a conditional data modeling approach based on vector-space text representations that achieves state-of-the-art results on several unusual semantic annotation tasks.
Anveshan: A Framework for Analysis of Multiple Annotators’ Labeling Behavior
TLDR
This work introduces a framework "Anveshan," where annotator behavior is investigated to find outliers, cluster annotators by behavior, and identify confusable labels, and shows that trained annotators are superior to a larger number of untrained annotators for this task.
A Methodology for Using Professional Knowledge in Corpus
TLDR
This dissertation aims at finding a way to capture expert domain knowledge quickly and easily as annotations, and in a format where the information can then be used for more advanced natural language processing (NLP) tasks.
Collecting Image Annotations Using Amazon’s Mechanical Turk
TLDR
It is found that the use of a qualification test provides the highest improvement of quality, whereas refining the annotations through follow-up tasks works rather poorly.
Crowdsourcing Annotation for Machine Learning in Natural Language Processing Tasks (NON-FINAL VERSION! Proofread version will be uploaded April 30, 2012.)
TLDR
A set of features that help distinguish well-formed translations from those that are not are discussed, and it is shown that crowdsourcing yields high-quality translations at a fraction of the cost of hiring professionals.
Scaling Semantic Frame Annotation
TLDR
It is shown that non-experts can be trained to perform accurate frame disambiguation, and can even identify errors in gold data used as the training exemplars, and the efficacy of this paradigm for semantic annotation requiring an intermediate level of expertise is demonstrated.
Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk
TLDR
It is found that when combined non-expert judgments have a high-level of agreement with the existing gold-standard judgments of machine translation quality, and correlate more strongly with expert judgments than Bleu does, Mechanical Turk can be used to calculate human-mediated translation edit rate (HTER), to conduct reading comprehension experiments with machine translation, and to create high quality reference translations.
Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds
TLDR
An experiment in which crowdsourcing methods employing native speakers to generate a list of coarse-grained senses under a common multilingual semantic taxonomy for sets of words in six languages is presented.
...
...

References

SHOWING 1-10 OF 44 REFERENCES
Scaling to Very Very Large Corpora for Natural Language Disambiguation
TLDR
This paper examines methods for effectively exploiting very large corpora when labeled data comes at a cost, and evaluates the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambigsuation.
Creating a Research Collection of Question Answer Sentence Pairs with Amazon's Mechanical Turk
TLDR
The Question-Answer Sentence Pairs (QASP) corpus is introduced and it is believed that this corpus can further stimulate research in QA, especially linguistically motivated research, where matching the question to the answer sentence by either syntactic or semantic means is a central concern.
Get another label? improving data quality and data mining using multiple, noisy labelers
TLDR
The results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.
Automatic Extraction of Useful Facet Hierarchies from Text Databases
TLDR
This paper presents an unsupervised technique for automatic extraction of facets useful for browsing text databases, and shows that its techniques produce facets with high precision and recall that are superior to existing approaches and help users locate interesting items faster.
Building a Sense Tagged Corpus with Open Mind Word Expert
TLDR
A Senseval-3 lexical sample activity where the training data is collected via Open Mind Word Expert and the collection process can be extended to create the definitive corpus of word sense information.
SWAT-MP:The SemEval-2007 Systems for Task 5 and Task 14
TLDR
Two SemEval-2007 entries are described, one of which is a supervised system that decides the most appropriate English translation of a Chinese target word and another that annotates headlines using a predefined list of emotions.
Overview of the TREC 2002 Question Answering Track
TLDR
This paper provides an overview of the TREC 2002 QA track, which defined how answer strings were judged, and established that different assessors have different ideas as to what constitutes a correct answer even for the limited type of questions used in the track.
The Proposition Bank: An Annotated Corpus of Semantic Roles
TLDR
An automatic system for semantic role tagging trained on the corpus is described and the effect on its performance of various types of information is discussed, including a comparison of full syntactic parsing with a flat representation and the contribution of the empty trace categories of the treebank.
Utility data annotation with Amazon Mechanical Turk
  • A. Sorokin, D. Forsyth
  • Biology
    2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
  • 2008
TLDR
This work shows how to outsource data annotation to Amazon Mechanical Turk, and describes results for several different annotation problems, including some strategies for determining when the task is well specified and properly priced.
Building a Large Annotated Corpus of English: The Penn Treebank
TLDR
As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.
...
...