The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

@article{Karpinska2021ThePO,
  title={The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation},
  author={Marzena Karpinska and Nader Akoury and Mohit Iyyer},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.06835}
}
Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choices by collecting crowdsourced human judgments of text quality (e.g., Likert scores of coherence or grammaticality) from Amazon Mechanical Turk (AMT). In this paper, we first conduct a survey of 45 open-ended text generation papers and find that the… 

Figures and Tables from this paper

FALTE: A Toolkit for Fine-grained Annotation for Long Text Evaluation

FALTE is introduced, a web-based annotation toolkit designed to streamline human rating and error analysis for evaluation of long text generation and allows researchers to collect fine-grained judgments of text quality from crowdworkers using an error taxonomy specific to the downstream task.

GENIE: Toward Reproducible and Standardized Human Evaluation for Text Generation

G ENIE is introduced: a system for running standardized human evaluations across different generation tasks, instantiate with datasets representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension, and develops an automated mechanism for maintaining annotator quality via a probabilistic model.

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers

M AUVE is introduced, a comparison measure for open-ended text generation, which directly compares the learnt distribution from a text generation model to the distribution of human-written text using divergence frontiers and scales up to modern text generation models by computing information divergences in a quantized embedding space.

In BLOOM: Creativity and Affinity in Artificial Lyrics and Art

A large multilingual language model is applied in open-ended generation of Chinese song lyrics, and the resulting lyrics are evaluated for coherence and creativity using human reviewers, demonstrating a creative approach for an artist seeking inspiration for an album or single.

Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian

This work presents the shared task on artificial text detection in Russian, which is organized as a part of the Dialogue Evaluation initiative, held in 2022, and provides count-based and BERT-based baselines, along with the human evaluation on the first sub-task.

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years and lays out a long-term vision for NLG evaluation and proposes concrete steps to improve their evaluation processes.

Human Heuristics for AI-Generated Language Are Flawed

Human communication is increasingly intermixed with language generated by AI. Across chat, email, and social media, AI systems produce smart replies, autocompletes, and translations. AI-generated

Tutorial on Artificial Text Detection

This tutorial aims at bringing awareness of artificial text detection, a fast-growing niche field devoted to mitigating the misuse of generative models, by outlining unresolved methodological problems and future work di-rections.

Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

This paper introduces a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sciences literature, and presents HANNA, an annotated dataset of 1,056 stories produced by 10 different ASG systems, to quantitatively evaluate the correlations of 72 automatic metrics with human criteria.

Long-term Control for Dialogue Generation: Methods and Evaluation

This work defines the problem of constrained long-term control for dialogue generation, identifies gaps in current methods for evaluation, and proposes new metrics that better measure long- term control that outperforms state-of-the-art constrained generation baselines.

References

SHOWING 1-10 OF 77 REFERENCES

Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation

A large-scale, systematic study to evaluate the existing evaluation methods for natural language generation in the context of generating online product reviews finds lexical diversity an intriguing metric that is indicative of the assessments of different evaluators.

All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text

The role untrained human evaluations play in NLG evaluation is examined and three approaches for quickly training evaluators to better identify GPT3-authored text are explored and it is found that while evaluation accuracy improved up to 55%, it did not significantly improve across the three domains.

GENIE: Toward Reproducible and Standardized Human Evaluation for Text Generation

G ENIE is introduced: a system for running standardized human evaluations across different generation tasks, instantiate with datasets representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension, and develops an automated mechanism for maintaining annotator quality via a probabilistic model.

Do Massively Pretrained Language Models Make Better Storytellers?

It is found that although GPT2-117 conditions more strongly on context, is more sensitive to ordering of events, and uses more unusual words, it is just as likely to produce repetitive and under-diverse text when using likelihood-maximizing decoding algorithms.

Sparse Text Generation

This paper uses the recently introduced entmax transformation to train and sample from a natively sparse language model, avoiding this mismatch between training and testing conditions and proposing three new metrics for comparing sparse or truncated distributions: $\epsilon$-perplexity, sparsemax score, and Jensen-Shannon divergence.

Can machine translation systems be evaluated by the crowd alone

A new methodology for crowd-sourcing human assessments of translation quality is presented, which allows individual workers to develop their own individual assessment strategy and has a substantially increased ability to identify significant differences between translation systems.

Controllable Story Generation with External Knowledge Using Large-Scale Language Models

MEGATRON-CNTRL is a novel framework that uses large-scale language models and adds control to text generation by incorporating an external knowledge base and showcases the controllability of the model by replacing the keywords used to generate stories and re-running the generation process.

Last Words: Amazon Mechanical Turk: Gold Mine or Coal Mine?

To define precisely what MTurk is and what it is not, it is hoped that this will point out opportunities for the community to deliberately value ethics above cost savings.

Hierarchical Neural Story Generation

This work collects a large dataset of 300K human-written stories paired with writing prompts from an online forum that enables hierarchical story generation, where the model first generates a premise, and then transforms it into a passage of text.

Generating similes effortlessly like a Pro: A Style Transfer Approach for Simile Generation

This paper proposes a method to automatically construct a parallel corpus by transforming a large number of similes collected from Reddit to their literal counterpart using structured common sense knowledge and fine-tune a pretrained sequence to sequence model, BART, on the literal-simile pairs to gain generalizability.
...