• Corpus ID: 227216975

Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions

@inproceedings{Howcroft2020TwentyYO,
  title={Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions},
  author={David M. Howcroft and Anya Belz and Miruna Clinciu and Dimitra Gkatzia and Sadid A. Hasan and Saad Mahamood and Simon Mille and Emiel van Miltenburg and Sashank Santhanam and Verena Rieser},
  booktitle={INLG},
  year={2020}
}
Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility. In this paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii… 
Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing
TLDR
This work proposes a classification system for evaluations based on disentangling what is being evaluated, and how it is evaluated in specific evaluation modes and experimental designs and shows that this approach provides a basis for determining comparability, hence for comparison of evaluations across papers, meta-evaluation experiments, reproducibility testing.
The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results
TLDR
The first shared task on reproducibility of human evaluations, ReproGen 2021, is organised and described in detail, results from each of the reproduction studies submitted are summarized, and further comparative analysis of the results are provided.
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
TLDR
This work designs templates which target a specific criteria and perturb the output such that the quality gets affected only along this specific criteria, and shows that existing evaluation metrics are not robust against even simple perturbations and disagree with scores assigned by humans to the perturbed output.
How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory
TLDR
This work identifies the implicit assumptions it makes about annotators and suggests improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation.
Reliability of Human Evaluation for Text Summarization: Lessons Learned and Challenges Ahead
Only a small portion of research papers with human evaluation for text summarization provide information about the participant demographics, task design, and experiment protocol. Additionally, many
The Human Evaluation Datasheet: A Template for Recording Details of Human Evaluation Experiments in NLP
This paper presents the Human Evaluation Datasheet (HEDS), a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP), and reports on first
ReproGen: Proposal for a Shared Task on Reproducibility of Human Evaluations in NLG
TLDR
The ideas for a shared task on reproducibility of human evaluations in NLG are outlined which aims to shed light on the extent to which past NLG evaluations are replicable and reproducible, and to draw conclusions regarding how evaluations can be designed and reported to increase replicability and reproducecibility.
Towards Human-Free Automatic Quality Evaluation of German Summarization
TLDR
This work demonstrates how to adjust the BLANC metric to a language other than English and shows that BLANC in German is especially good in evaluating informativeness.
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
TLDR
This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years and lays out a long-term vision for NLG evaluation and proposes concrete steps to improve their evaluation processes.
All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text
TLDR
The role untrained human evaluations play in NLG evaluation is examined and three approaches for quickly training evaluators to better identify GPT3-authored text are explored and it is found that while evaluation accuracy improved up to 55%, it did not significantly improve across the three domains.
...
...

References

SHOWING 1-10 OF 20 REFERENCES
Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing
TLDR
This work proposes a classification system for evaluations based on disentangling what is being evaluated, and how it is evaluated in specific evaluation modes and experimental designs and shows that this approach provides a basis for determining comparability, hence for comparison of evaluations across papers, meta-evaluation experiments, reproducibility testing.
An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems
TLDR
The results of two studies of how well some metrics which are popular in other areas of NLP correlate with human judgments in the domain of computer-generated weather forecasts suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as one would ideally like to see.
A Structured Review of the Validity of BLEU
TLDR
The evidence supports using BLEU for diagnostic evaluation of MT systems, but does not support using it outside of MT, for evaluation of individual texts, or for scientific hypothesis testing.
Evaluation methodologies in Automatic Question Generation 2013-2018
TLDR
This study suggests that, given the rapidly increasing level of research in the area, a common framework is urgently needed to compare the performance of AQG systems and NLG systems more generally.
Using the crowd for readability prediction
TLDR
It is concluded that readability assessment by comparing texts is a polyvalent methodology, which can be adapted to specific domains and target audiences if required.
Best practices for the human evaluation of automatically generated text
TLDR
This paper provides an overview of how human evaluation is currently conducted, and presents a set of best practices, grounded in the literature, for Natural Language Generation systems.
Evaluation of Text Generation: A Survey
TLDR
This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.
A Snapshot of NLG Evaluation Practices 2005 - 2014
TLDR
A snapshot of endto-end NLG system evaluations as presented in conference and journal papers over the last ten years is presented to better understand the nature and type of evaluations that have been undertaken.
RankME: Reliable Human Ratings for Natural Language Generation
TLDR
This work presents a novel rank-based magnitude estimation method (RankME), which combines the use of continuous scales and relative assessments, and shows that RankME significantly improves the reliability and consistency of human ratings compared to traditional evaluation methods.
Why We Need New Evaluation Metrics for NLG
TLDR
A wide range of metrics are investigated, including state-of-the-art word-based and novel grammar-based ones, and it is demonstrated that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG.
...
...