How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation

  title={How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation},
  author={Julius Steen and Katja Markert},
Manual evaluation is essential to judge progress on automatic text summarization. However, we conduct a survey on recent summarization system papers that reveals little agreement on how to perform such evaluation studies. We conduct two evaluation experiments on two aspects of summaries’ linguistic quality (coherence and repetitiveness) to compare Likert-type and ranking annotations and show that best choice of evaluation method can vary from one aspect to another. In our survey, we also find… 

Figures and Tables from this paper

Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries
This work crowdsource evaluations for factual consistency across state-of-the-art models on two news summarization datasets using the rating-based Likert Scale and ranking-based Best-Worst Scaling and presents a scoring algorithm for Best-w worst Scaling that is called value learning.
SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization
SummVis, an open-source tool for visualizing abstractive summaries that enables fine-grained analysis of the models, data, and evaluation metrics associated with text summarization, is introduced.
A Text Mining using Web Scraping for Meaningful Insights
The aim of this tool is to efficiently extract a concise and a coherent version, having only the main needed outline points from the long text or the input document avoiding any type of repetitions of the same text or information that has already been mentioned earlier in the text.


An Assessment of the Accuracy of Automatic Evaluation in Summarization
An assessment of the automatic evaluations used for multi-document summarization of news, and recommendations about how any evaluation, manual or automatic, should be used to find statistically significant differences between summarization systems.
Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE
An analysis of current evaluation methodologies applied to summarization metrics reveals for the first time which metric variants significantly outperform others, optimal metric variants distinct from current recommended best variants, as well as machine translation metric BLEU to have performance on-par with ROUGE for the purpose of evaluation of summarization systems.
Studying Summarization Evaluation Metrics in the Appropriate Scoring Range
It is shown that, surprisingly, evaluation metrics which behave similarly on these datasets (average- scoring range) strongly disagree in the higher-scoring range in which current systems now operate.
SummEval: Re-evaluating Summarization Evaluation
This work re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations and implements and shares a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics.
Towards Best Experiment Design for Evaluating Dialogue System Output
Through a systematic study with 40 crowdsourced workers in each task, it is found that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design and that factors such as time taken to complete the task and no prior experience of participating in similar studies of rating dialogue system output positively impact consistency and agreement amongst raters.
HighRES: Highlight-based Reference-less Evaluation of Summarization
A novel approach for manual evaluation, Highlight-based Reference-less Evaluation of Summarization (HighRES), in which summaries are assessed by multiple annotators against the source document via manually highlighted salient content in the latter, which improves inter-annotator agreement in comparison to using the source documents directly.
Efficient Elicitation of Annotations for Human Evaluation of Machine Translation
The experimental results show that TrueSkill outperforms other recently proposed models on accuracy, and also can significantly reduce the number of pairwise annotations that need to be collected by sampling non-uniformly from the space of system competitions.
The Feasibility of Embedding Based Automatic Evaluation for Single Document Summarization
The experimental results show that the max value over each dimension of the summary ELMo word embeddings is a good representation that results in high correlation with human ratings, and averaging the cosine similarity of all encoders the authors tested yieldsHigh correlation with manual scores in reference-free setting.
Estimating Summary Quality with Pairwise Preferences
This paper proposes an alternative evaluation approach based on pairwise preferences of sentences that performs better than the three most popular versions of ROUGE with less expensive human input and can reuse existing evaluation data and achieve even better results.
Rethinking the Agreement in Human Evaluation Tasks
This paper examines how annotators diverge in language annotation tasks due to a range of ineliminable factors and suggests a new approach to the use of the agreement metrics in natural language generation evaluation tasks.