All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text

@inproceedings{Clark2021AllT,
  title={All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text},
  author={Elizabeth Clark and Tal August and Sofia Serrano and Nikita Haduong and Suchin Gururangan and Noah A. Smith},
  booktitle={ACL},
  year={2021}
}
Human evaluations are typically considered the gold standard in natural language generation, but as models’ fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts’ ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore… 

Figures and Tables from this paper

How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory
TLDR
This work identifies the implicit assumptions it makes about annotators and suggests improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation.
Understanding Human Potentials for Evaluating Generative Models
TLDR
Focusing on natural language generation, a method to dynamically measure the required human annotations when evaluating models in a relative comparison setting is proposed, ensuring sufficient labelling to reach a confident decision on the optimal model with high probability when comparing two generative models.
Dynamic Human Evaluation for Relative Model Comparisons
TLDR
This work proposes an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study, and results indicate that a decision about the superior model can be made with high probability across differentlabelling strategies.
Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text
TLDR
This work proposes a new framework called Scarecrow for scrutinizing machine text via crowd annotation, and quantifies measurable gaps between human authored text and generations from models of several sizes, including fourteen configurations of GPT-3.
The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation
TLDR
It is shown that AMT worker judgments improve when they are shown model-generated output alongside human-generated references, which enables the workers to better calibrate their ratings, and interviews with the English teachers provide deeper insights into the challenges of the evaluation process.
Learning to Rank Visual Stories From Human Ranking Data
TLDR
This paper develops Vrank (VIST Ranker), a novel reference-free VIST metric for story evaluation that shows the superiority of Vrank by its generalizability to pure textual stories, and concludes that this reuse of human evaluation results puts Vrank in a strong position for continued future advances.
Evaluation of Interest and Coherence in Machine Generated Stories
TLDR
The results show variation in human evaluation results in comparison to automated metrics, suggesting further work is required to train automated metrics to identify text that is defined as interesting by humans.
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
TLDR
This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years and lays out a long-term vision for NLG evaluation and proposes concrete steps to improve their evaluation processes.
Transparent Human Evaluation for Image Captioning
TLDR
THumB, a rubric-based human evaluation protocol for image captioning models, is established and results reveal that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sensitive to recall.
Pre-trained language models evaluating themselves - A comparative study
TLDR
This work examines the recently introduced metrics BERTScore, BLEURT, NUBIA, MoverScore, and Mark-Evaluate and finds no metric showed appropriate behaviour for negation, and further none of them was overall sensitive to the other issues mentioned above.
...
...

References

SHOWING 1-10 OF 42 REFERENCES
Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation
TLDR
A large-scale, systematic study to evaluate the existing evaluation methods for natural language generation in the context of generating online product reviews finds lexical diversity an intriguing metric that is indicative of the assessments of different evaluators.
Comparing Automatic and Human Evaluation of NLG Systems
TLDR
It is found that NI ST scores correlate best with human judgments, but that all automatic metrics the authors examined are biased in favour of generators that select on the basis of frequency alone.
Unifying Human and Statistical Evaluation for Natural Language Generation
TLDR
This paper proposes a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated, called HUSE, which is efficiently estimated by combining human and statistical evaluation.
Best practices for the human evaluation of automatically generated text
TLDR
This paper provides an overview of how human evaluation is currently conducted, and presents a set of best practices, grounded in the literature, for Natural Language Generation systems.
RankME: Reliable Human Ratings for Natural Language Generation
TLDR
This work presents a novel rank-based magnitude estimation method (RankME), which combines the use of continuous scales and relative assessments, and shows that RankME significantly improves the reliability and consistency of human ratings compared to traditional evaluation methods.
Rethinking the Agreement in Human Evaluation Tasks
TLDR
This paper examines how annotators diverge in language annotation tasks due to a range of ineliminable factors and suggests a new approach to the use of the agreement metrics in natural language generation evaluation tasks.
“This is a Problem, Don’t You Agree?” Framing and Bias in Human Evaluation for Natural Language Generation
Despite recent efforts reviewing current human evaluation practices for natural language generation (NLG) research, the lack of reported question wording and potential for framing effects or
Towards Best Experiment Design for Evaluating Dialogue System Output
TLDR
Through a systematic study with 40 crowdsourced workers in each task, it is found that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design and that factors such as time taken to complete the task and no prior experience of participating in similar studies of rating dialogue system output positively impact consistency and agreement amongst raters.
Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions
TLDR
Due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.
...
...