Finding a Balanced Degree of Automation for Summary Evaluation

  title={Finding a Balanced Degree of Automation for Summary Evaluation},
  author={Shiyue Zhang and Mohit Bansal},
Human evaluation for summarization tasks is reliable but brings in issues of reproducibility and high costs. Automatic metrics are cheap and reproducible but sometimes poorly correlated with human judgment. In this work, we propose flexible semiautomatic to automatic summary evaluation metrics, following the Pyramid human evaluation method. Semi-automatic Lite2Pyramid retains the reusable human-labeled Summary Content Units (SCUs) for reference(s) but replaces the manual work of judging SCUs… 

Figures and Tables from this paper

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years and lays out a long-term vision for NLG evaluation and proposes concrete steps to improve their evaluation processes.
Repro: An Open-Source Library for Improving the Reproducibility and Usability of Publicly Available Research Code
We introduce Repro, an open-source library which aims at improving the reproducibility and usability of research code. The library provides a lightweight Python API for running software released by


Asking and Answering Questions to Evaluate the Factual Consistency of Summaries
QAGS (pronounced “kags”), an automatic evaluation protocol that is designed to identify factual inconsistencies in a generated summary, is proposed and is believed to be a promising tool in automatically generating usable and factually consistent text.
Formal and functional assessment of the pyramid method for summary content evaluation*
  • R. Passonneau
  • Environmental Science
    Natural Language Engineering
  • 2009
A formal assessment of pyramid data from the 2003 Document Understanding Conference is presented; this addresses the method's ability to discriminate systems across years and indicates that the statistical power of the method is more than sufficient to identify statistically significant differences among systems.
PEAK: Pyramid Evaluation via Automated Knowledge Extraction
PEAK is proposed, the first method to automatically assess summary content using the pyramid method that also generates the pyramid content models, and relies on open information extraction and graph algorithms.
Learning to Score System Summaries for Better Content Selection Evaluation.
This work proposes to learn an automatic scoring metric based on the human judgements available as part of classical summarization datasets like TAC-2008 and Tac-2009, and releases the trained metric as an open-source tool.
Fill in the BLANC: Human-free quality estimation of document summaries
Evidence is presented that BLANC scores have as good correlation with human evaluations as do the ROUGE family of summary quality measurements, and the method does not require human-written reference summaries, allowing for fully human-free summary quality estimation.
SUM-QE: a BERT-based Summary Quality Estimation Model
The model addresses linguistic quality aspects that are only indirectly captured by content-based approaches to summary evaluation, without involving comparison with human references, and achieves very high correlations with human ratings.
Fact-based Content Weighting for Evaluating Abstractive Summarisation
A new evaluation metric which is based on fact-level content weighting, i.e. relating the facts of the document to thefacts of the summary, is introduced which is highly correlated to human perception and compares favourably to the recent manual highlight- based metric.
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
This work proposes pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective, PEGASUS, and demonstrates it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores.
FEVER: a Large-scale Dataset for Fact Extraction and VERification
This paper introduces a new publicly available dataset for verification against textual sources, FEVER, which consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from.
An Information-Theoretic Approach to Automatic Evaluation of Summaries
This paper introduces an information-theoretic approach to automatic evaluation of summaries based on the Jensen-Shannon divergence of distributions between an automatic summary and a set of reference summaries and results indicate that JS divergence-based evaluation method achieves comparable performance with the common automatic evaluation method ROUGE in single documents summarization task.