A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

  title={A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods},
  author={Daniel Deutsch and Rotem Dror and Dan Roth},
  journal={Transactions of the Association for Computational Linguistics},
Abstract The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are, nor whether differences between two metrics’ correlations reflect a true difference or if it is due to mere chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests… 
Does Summary Evaluation Survive Translation to Other Languages?
This work translates the English dataset SummEval to seven languages and explores equivalence testing as the appropriate statistical paradigm for evaluating correlations between human and automated scoring of summaries.
Automatic Text Evaluation through the Lens of Wasserstein Barycenters
The results show that BaryScore outperforms other BERT based metrics and exhibits more consistent behaviour in particular for text summarization.
Minimum Bayes Risk Decoding with Neural Metrics of Translation Quality
This work applies Minimum Bayes Risk (MBR) decoding to optimize diverse automated metrics of translation quality and shows that the combination of a neural translation model with a neural referencebased metric, BLEURT, results in significant improvement in automatic and human evaluations.
Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain
This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked to score the outputs of the translation systems competing in the WMT21 News Translation Task with automatic
Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary
This work proposes a metric to evaluate the content quality of a summary using question-answering (QA), and identifies its performance bottlenecks and estimates that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.


Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE
An analysis of current evaluation methodologies applied to summarization metrics reveals for the first time which metric variants significantly outperform others, optimal metric variants distinct from current recommended best variants, as well as machine translation metric BLEU to have performance on-par with ROUGE for the purpose of evaluation of summarization systems.
An Assessment of the Accuracy of Automatic Evaluation in Summarization
An assessment of the automatic evaluations used for multi-document summarization of news, and recommendations about how any evaluation, manual or automatic, should be used to find statistically significant differences between summarization systems.
A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art
This work analyzes the performance of eight ROUGE variants in terms of accuracy, precision and recall in finding significantly different systems and shows that some of the neglected variants of RouGE, based on higher order n-grams and syntactic dependencies, are most accurate across the years.
Summarization system evaluation revisited: N-gram graphs
A novel automatic method for the evaluation of summarization systems, based on comparing the character n-gram graphs representation of the extracted summaries and a number of model summaries, which appears to hold a level of evaluation performance that matches and even exceeds other contemporary evaluation methods.
Re-evaluating Evaluation in Text Summarization
Assessing the reliability of automatic metrics using top-scoring system outputs on recently popular datasets for both system-level and summary-level evaluation settings finds that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
Summary Evaluation: Together We Stand NPowER-ed
The NPowER evaluation method based on machine learning and a set of methods from the family of "n-gram graph"-based summary evaluation methods are proposed and it is shown that the combined, optimized use of the evaluation methods outperforms the individual ones.
Learning to Score System Summaries for Better Content Selection Evaluation.
This work proposes to learn an automatic scoring metric based on the human judgements available as part of classical summarization datasets like TAC-2008 and Tac-2009, and releases the trained metric as an open-source tool.
Testing for Significance of Increased Correlation with Human Judgment
A significance test for comparing correlations of two metrics, along with an open-source implementation of the test, are introduced, which shows that for a high proportion of metrics, there is insufficient evidence to conclude significant improvement over BLEU.
An Empirical Investigation of Statistical Significance in NLP
Two aspects of the empirical behavior of paired significance tests for NLP systems are investigated, including when one system appears to outperform another, and once significance levels are computed, how well does the standard i.i.d. notion of significance hold up in practical settings where future distributions are neither independent nor identically distributed.
SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics
The design of SacreROUGE is described, including the core Metric interface, the command-line API for evaluating summarization models and metrics, and the scripts to load and reformat publicly available datasets.