# The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

@article{Karpinska2021ThePO,
title={The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation},
author={Marzena Karpinska and Nader Akoury and Mohit Iyyer},
journal={ArXiv},
year={2021},
volume={abs/2109.06835}
}
• Published 14 September 2021
• Computer Science
• ArXiv
Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choices by collecting crowdsourced human judgments of text quality (e.g., Likert scores of coherence or grammaticality) from Amazon Mechanical Turk (AMT). In this paper, we first conduct a survey of 45 open-ended text generation papers and find that the…

## Figures and Tables from this paper

• Computer Science
• 2022
FALTE is introduced, a web-based annotation toolkit designed to streamline human rating and error analysis for evaluation of long text generation and allows researchers to collect fine-grained judgments of text quality from crowdworkers using an error taxonomy specific to the downstream task.
• Computer Science
• 2021
G ENIE is introduced: a system for running standardized human evaluations across different generation tasks, instantiate with datasets representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension, and develops an automated mechanism for maintaining annotator quality via a probabilistic model.
• Computer Science
NeurIPS
• 2021
M AUVE is introduced, a comparison measure for open-ended text generation, which directly compares the learnt distribution from a text generation model to the distribution of human-written text using divergence frontiers and scales up to modern text generation models by computing information divergences in a quantized embedding space.
• Computer Science
ArXiv
• 2023
A large multilingual language model is applied in open-ended generation of Chinese song lyrics, and the resulting lyrics are evaluated for coherence and creativity using human reviewers, demonstrating a creative approach for an artist seeking inspiration for an album or single.
• Computer Science
Computational Linguistics and Intellectual Technologies
• 2022
This work presents the shared task on artificial text detection in Russian, which is organized as a part of the Dialogue Evaluation initiative, held in 2022, and provides count-based and BERT-based baselines, along with the human evaluation on the first sub-task.
• Computer Science
ArXiv
• 2022
This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years and lays out a long-term vision for NLG evaluation and proposes concrete steps to improve their evaluation processes.
• Psychology
ArXiv
• 2022
Human communication is increasingly intermixed with language generated by AI. Across chat, email, and social media, AI systems produce smart replies, autocompletes, and translations. AI-generated
• Computer Science
• 2022
This tutorial aims at bringing awareness of artificial text detection, a fast-growing niche field devoted to mitigating the misuse of generative models, by outlining unresolved methodological problems and future work di-rections.
• Computer Science
COLING
• 2022
This paper introduces a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sciences literature, and presents HANNA, an annotated dataset of 1,056 stories produced by 10 different ASG systems, to quantitatively evaluate the correlations of 72 automatic metrics with human criteria.
• Computer Science
NAACL
• 2022
This work defines the problem of constrained long-term control for dialogue generation, identifies gaps in current methods for evaluation, and proposes new metrics that better measure long- term control that outperforms state-of-the-art constrained generation baselines.

## References

SHOWING 1-10 OF 77 REFERENCES

• Computer Science
EMNLP
• 2019
A large-scale, systematic study to evaluate the existing evaluation methods for natural language generation in the context of generating online product reviews finds lexical diversity an intriguing metric that is indicative of the assessments of different evaluators.
• Psychology
ACL
• 2021
The role untrained human evaluations play in NLG evaluation is examined and three approaches for quickly training evaluators to better identify GPT3-authored text are explored and it is found that while evaluation accuracy improved up to 55%, it did not significantly improve across the three domains.
• Computer Science
• 2021
G ENIE is introduced: a system for running standardized human evaluations across different generation tasks, instantiate with datasets representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension, and develops an automated mechanism for maintaining annotator quality via a probabilistic model.
• Computer Science
CoNLL
• 2019
It is found that although GPT2-117 conditions more strongly on context, is more sensitive to ordering of events, and uses more unusual words, it is just as likely to produce repetitive and under-diverse text when using likelihood-maximizing decoding algorithms.
• Computer Science
EMNLP
• 2020
This paper uses the recently introduced entmax transformation to train and sample from a natively sparse language model, avoiding this mismatch between training and testing conditions and proposing three new metrics for comparing sparse or truncated distributions: $\epsilon$-perplexity, sparsemax score, and Jensen-Shannon divergence.
• Computer Science
Natural Language Engineering
• 2015
A new methodology for crowd-sourcing human assessments of translation quality is presented, which allows individual workers to develop their own individual assessment strategy and has a substantially increased ability to identify significant differences between translation systems.
• Computer Science
EMNLP
• 2020
MEGATRON-CNTRL is a novel framework that uses large-scale language models and adds control to text generation by incorporating an external knowledge base and showcases the controllability of the model by replacing the keywords used to generate stories and re-running the generation process.
• Geology
CL
• 2011
To define precisely what MTurk is and what it is not, it is hoped that this will point out opportunities for the community to deliberately value ethics above cost savings.
• Computer Science
ACL
• 2018
This work collects a large dataset of 300K human-written stories paired with writing prompts from an online forum that enables hierarchical story generation, where the model first generates a premise, and then transforms it into a passage of text.
• Computer Science
EMNLP
• 2020
This paper proposes a method to automatically construct a parallel corpus by transforming a large number of similes collected from Reddit to their literal counterpart using structured common sense knowledge and fine-tune a pretrained sequence to sequence model, BART, on the literal-simile pairs to gain generalizability.