A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

  title={A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods},
  author={Daniel Deutsch and Rotem Dror and Dan Roth},
  journal={Transactions of the Association for Computational Linguistics},
Abstract The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are, nor whether differences between two metrics’ correlations reflect a true difference or if it is due to mere chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests… 

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

This work identifies two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice and proposes changes to rectify this disconnect.

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

A modified summarization salience protocol, Atomic Content Units (ACUs), which relies onained semantic units and al-lows for high inter-annotator agreement is proposed, which has important implications for evaluating large language models (LLMs), as it shows that LLMs adjusted by human feedback may over-strained human evaluation.

Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics

This work benchmarks the lexical answer verification methods used by current QA-based metrics as well as two more sophisticated text comparison methods, BERTScore and LERC, and finds that LERC out-performs the other methods in some settings while remaining statistically indistinguishable from lexical overlap in others.

Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

This work proposes a metric to evaluate the content quality of a summary using question-answering (QA), and identifies its performance bottlenecks and estimates that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.

Does Summary Evaluation Survive Translation to Other Languages?

This work translates the English SummEval dataset to seven languages and explores equivalence testing as the appropriate statistical paradigm for evaluating correlations between human and automated scoring of summaries, finding some potential for dataset reuse in languages similar to the source and along particular dimensions of summary quality.

How to Find Strong Summary Coherence Measures? A Toolbox and a Comparative Study for Summary Coherence Measure Evaluation

This work conducts a large-scale investigation of various methods for summary coherence modelling on an even playing field and introduces two novel analysis measures, _intra-system correlation_ and _bias matrices_, that help identify biases in coherence measures and provide robustness against system-level confounders.

LENS: A Learnable Evaluation Metric for Text Simplification

This work introduces R ANK & R ATE , a human evaluation framework that rates simplifications from several models in a list-wise manner by leveraging an interactive interface, which ensures both consistency and accuracy in the evaluation process.

On the Limitations of Reference-Free Evaluations of Generated Text

It is demonstrated that reference-free metrics are inherently biased and limited in their ability to evaluate generated text, and it is argued that they should not be used to measure progress on tasks like machine translation or summarization.

Embarrassingly Easy Document-Level MT Metrics: How to Convert Any Pretrained Metric Into a Document-Level Metric

The experimental results support the initial hypothesis and show that a simple ex-tension of the metrics permits them to take advantage of context to resolve ambiguities in the reference.

On the State of German (Abstractive) Text Summarization

A comprehensive assessment of available models on the cleaned versions of datasets is provided, and it is found that this can lead to a reduction of more than 20 ROUGE-1 points during evaluation.



Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE

An analysis of current evaluation methodologies applied to summarization metrics reveals for the first time which metric variants significantly outperform others, optimal metric variants distinct from current recommended best variants, as well as machine translation metric BLEU to have performance on-par with ROUGE for the purpose of evaluation of summarization systems.

A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art

This work analyzes the performance of eight ROUGE variants in terms of accuracy, precision and recall in finding significantly different systems and shows that some of the neglected variants of RouGE, based on higher order n-grams and syntactic dependencies, are most accurate across the years.

Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

This work proposes a metric to evaluate the content quality of a summary using question-answering (QA), and identifies its performance bottlenecks and estimates that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.

SummEval: Re-evaluating Summarization Evaluation

This work re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations and implements and shares a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics.

Summarization system evaluation revisited: N-gram graphs

A novel automatic method for the evaluation of summarization systems, based on comparing the character n-gram graphs representation of the extracted summaries and a number of model summaries, which appears to hold a level of evaluation performance that matches and even exceeds other contemporary evaluation methods.

Re-evaluating Evaluation in Text Summarization

Assessing the reliability of automatic metrics using top-scoring system outputs on recently popular datasets for both system-level and summary-level evaluation settings finds that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.

Summary Evaluation: Together We Stand NPowER-ed

The NPowER evaluation method based on machine learning and a set of methods from the family of "n-gram graph"-based summary evaluation methods are proposed and it is shown that the combined, optimized use of the evaluation methods outperforms the individual ones.

Testing for Significance of Increased Correlation with Human Judgment

A significance test for comparing correlations of two metrics, along with an open-source implementation of the test, are introduced, which shows that for a high proportion of metrics, there is insufficient evidence to conclude significant improvement over BLEU.

An Empirical Investigation of Statistical Significance in NLP

Two aspects of the empirical behavior of paired significance tests for NLP systems are investigated, including when one system appears to outperform another, and once significance levels are computed, how well does the standard i.i.d. notion of significance hold up in practical settings where future distributions are neither independent nor identically distributed.

Results of the WMT20 Metrics Shared Task

An extensive analysis on influence of different reference translations on metric reliability, how well automatic metrics score human translations, and major discrepancies between metric and human scores when evaluating MT systems are presented.