Corpus ID: 227216991

ReproGen: Proposal for a Shared Task on Reproducibility of Human Evaluations in NLG

  title={ReproGen: Proposal for a Shared Task on Reproducibility of Human Evaluations in NLG},
  author={Anya Belz and Shubham Agarwal and Anastasia Shimorina and Ehud Reiter},
Across NLP, a growing body of work is looking at the issue of reproducibility. However, replicability of human evaluation experiments and reproducibility of their results is currently under-addressed, and this is of particular concern for NLG where human evaluations are the norm. This paper outlines our ideas for a shared task on reproducibility of human evaluations in NLG which aims (i) to shed light on the extent to which past NLG evaluations are replicable and reproducible, and (ii) to draw… Expand
The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results
The NLP field has recently seen a substantial increase in work related to reproducibility of results, and more generally in recognition of the importance of having shared definitions and practicesExpand
Quantifying Reproducibility in NLP and ML
Reproducibility has become an intensely debated topic in NLP and ML over recent years, but no commonly accepted way of assessing reproducibility, let alone quantifying it, has so far emerged. TheExpand
Reproducing a Comparison of Hedged and Non-hedged NLG Texts
This paper describes an attempt to reproduce an earlier experiment, previously conducted by the author, that compares hedged and non-hedged NLG texts as part of the ReproGen shared challenge. ThisExpand
Another PASS: A Reproduction Study of the Human Evaluation of a Football Report Generation System
This paper reports results from a reproduction study in which we repeated the human evaluation of the PASS Dutch-language football report generation system (van der Lee et al., 2017). The work wasExpand


Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions
Due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology. Expand
Community Perspective on Replicability in Natural Language Processing
A survey is used to investigate how the NLP community perceives the topic of replicability in general, and confirms earlier observations, that successful reproducibility requires more than having access to code and data. Expand
Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing
This work proposes a classification system for evaluations based on disentangling what is being evaluated, and how it is evaluated in specific evaluation modes and experimental designs and shows that this approach provides a basis for determining comparability, hence for comparison of evaluations across papers, meta-evaluation experiments, reproducibility testing. Expand
Agreement is overrated: A plea for correlation to assess human evaluation reliability
Given human language variability, it is proposed that for human evaluation of NLG, correlation coefficients and agreement coefficients should be used together to obtain a better assessment of the evaluation data reliability. Expand
A Structured Review of the Validity of BLEU
  • Ehud Reiter
  • Computer Science
  • Computational Linguistics
  • 2018
The evidence supports using BLEU for diagnostic evaluation of MT systems, but does not support using B LEU outwith MT, for evaluation of individual texts, or for scientific hypothesis testing. Expand
Why We Need New Evaluation Metrics for NLG
A wide range of metrics are investigated, including state-of-the-art word-based and novel grammar-based ones, and it is demonstrated that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. Expand
An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems
The results suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as the authors would ideally like to see; however, they do not provide a helpful measure of content quality. Expand
ICLR Reproducibility Challenge 2019
DOI 10.5281/zenodo.3158244 Welcome to this special issue of the ReScience C journal, which presents results of the 2019 ICLR Reproducibility Challenge (2nd edition). One of the challenges in machineExpand
Best practices for the human evaluation of automatically generated text
This paper provides an overview of how human evaluation is currently conducted, and presents a set of best practices, grounded in the literature, for Natural Language Generation systems. Expand
The reproducibility “crisis”
  • P. Hunter
  • Psychology, Medicine
  • EMBO reports
  • 2017
Evidence from larger meta‐analysis of past papers also points to a lack of reproducibility in biomedical research with potentially dire consequences for drug development and investment into research. Expand