• Corpus ID: 9008917

A Methodology for Extrinsic Evaluation of Text Summarization: Does ROUGE Correlate?

@inproceedings{Dorr2005AMF,
  title={A Methodology for Extrinsic Evaluation of Text Summarization: Does ROUGE Correlate?},
  author={B. Dorr and Christof Monz and Stacy President and Richard M. Schwartz and David M. Zajic},
  booktitle={IEEvaluation@ACL},
  year={2005}
}
This paper demonstrates the usefulness of summaries in an extrinsic task of relevance judgment based on a new method for measuring agreement, Relevance-Prediction, which compares subjects’ judgments on summaries with their own judgments on full text documents. We demonstrate that, because this measure is more reliable than previous gold-standard measures, we are able to make stronger statistical statements about the benefits of summarization. We found positive correlations between ROUGE scores… 
TEXT SUMMARIZATION EVALUATION: CORRELATING HUMAN PERFORMANCE ON AN EXTRINSIC TASK WITH AUTOMATIC INTRINSIC METRICS
TLDR
Preliminary experimental results suggest that the Relevance Prediction method yields better performance measurements with human summaries than that of the LDC-Agreement method and that small correlations are seen with one of the automatic intrinsic evaluation metrics and human task-based performance results.
Finding Good Enough: A Task-Based Evaluation of Query Biased Summarization for Cross-Language Information Retrieval
TLDR
This paper presents their task-based evaluation of query biased summarization for cross-language information retrieval (CLIR) using relevance prediction, and finds that overall query biased word clouds are the best summarization strategy.
What Makes a Good Summary? Reconsidering the Focus of Automatic Summarization
TLDR
A survey amongst heavy users of pre-made summaries finds that the current focus of the field does not fully align with participants' wishes, and proposes a methodology to evaluate the usefulness of a summary.
The elements of automatic summarization
TLDR
This thesis is about automatic summarization, with experimental results on multi-document news topics: how to choose a series of sentences that best represents a collection of articles about one topic, using an objective function for summarization that is called "maximum coverage".
Extrinsic summarization evaluation: A decision audit task
TLDR
It is found that while ASR errors affect user satisfaction on an information retrieval task, users can adapt their browsing behavior to complete the task satisfactorily and consider extractive summaries to be intuitive and useful tools for browsing multimodal meeting data.
Bayesian Summarization at DUC and a Suggestion for Extrinsic Evaluation
We describe our entry into the Document Understanding Conference competition for evaluating query-focused multidocument summarization systems. Our system is based on a Bayesian QueryFocused
What Makes a Good Summary?
Automatic text summarization has enjoyed great progress over the last years. However, there is little research that investigates whether the current research focus adheres to users’ needs.
Distributional Semantics for Robust Automatic Summarization
TLDR
It is argued that current automatic summarization systems avoid relying on semantic analysis by focusing instead on replicating the source text to be summarized, but that substantial progress will not be possible without semantic analysis and domain knowledge acquisition.
Text Content and Task Performance in the Evaluation of a Natural Language Generation System
TLDR
The outcomes of a task-based evaluation of a system that generates summaries of patient data are investigated, attempting to correlate these with the results of an analysis of the system’s texts, compared to a set of gold standard human-authored summaries.
Extrinsic Summarization Evaluation: A Decision Audit Task
TLDR
This work describes a large-scale extrinsic evaluation of automatic speech summarization technologies for meeting speech, wherein a user must satisfy a complex information need, navigating several meetings in order to gain an understanding of how and why a given decision was made.
...
1
2
3
4
...

References

SHOWING 1-10 OF 23 REFERENCES
Extrinsic Evaluation of Automatic Metrics for Summarization
TLDR
It is shown that it is possible to save time using summaries for relevance assessment without adversely impacting the degree of accuracy that would be possible with full documents, and a small yet statistically significant correlation between some of the intrinsic measures and a user's performance in an extrinsic task is found.
SUMMAC: a text summarization evaluation
TLDR
Analysis of feedback forms filled in after each decision indicated that the intelligibility of present-day machine-generated summaries is high, and the evaluation methods used in the SUMMAC evaluation are of interest to both summarization evaluation as well as evaluation of other ‘output-related’ NLP technologies, where there may be many potentially acceptable outputs.
Summarization Evaluation Methods: Experiments and Analysis
TLDR
The results show that different parameters of an experiment can affect how well a system scores, and describe how parameters can be controlled to produce a sound evaluation.
ROUGE: A Package for Automatic Evaluation of Summaries
TLDR
Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations.
Headline Evaluation Experiment Results
This technical report describes an experiment intending to show that different summarization techniques have an effect on human performance of an extrinsic task. The task is document selection in the
Examining the consensus between human summaries: initial experiments with factoid analysis
We present a new approach to summary evaluation which combines two novel aspects, namely (a) content comparison between gold standard summary and system summary via factoids, a pseudo-semantic
Evaluation of Phrase-Representation Summarization based on Information Retrieval Task
We have developed an improved task-based evaluation method of summarization, the accuracy of which is increased by specifying the details of the task including background stories, and by assigning
Evaluating Content Selection in Summarization: The Pyramid Method
TLDR
It is argued that the method presented is reliable, predictive and diagnostic, thus improves considerably over the shortcomings of the human evaluation method currently used in the Document Understanding Conference.
ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation
TLDR
A new evaluation method, Orange, is introduced for evaluating automatic machine translation evaluation metrics automatically without extra human involvement other than using a set of reference translations.
Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics
TLDR
The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results.
...
1
2
3
...