• Corpus ID: 55156862

Looking for a Few Good Metrics: ROUGE and its Evaluation

@inproceedings{Lin2004LookingFA,
  title={Looking for a Few Good Metrics: ROUGE and its Evaluation},
  author={Chin-Yew Lin},
  year={2004}
}
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper discusses the validity of the evaluation method used in the Document Understanding… 

Figures and Tables from this paper

ROUGE: A Package for Automatic Evaluation of Summaries
TLDR
Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations.
A Graph-theoretic Summary Evaluation for ROUGE
TLDR
Experimental results over TAC AESOP datasets show that exploiting the lexico-semantic similarity of the words used in summaries would significantly help ROUGE correlate better with human judgments.
Better Summarization Evaluation with Word Embeddings for ROUGE
TLDR
This proposal uses word embeddings to overcome the disadvantage of ROUGE in the evaluation of abstractive summarization, or summaries with substantial paraphrasing, by using them to compute the semantic similarity of the words used in summaries instead.
A Semantically Motivated Approach to Compute ROUGE Scores
TLDR
A graph-based algorithm is adopted into ROUGE to capture the semantic similarities between peer and model summaries and indicates that exploiting the lexico-semantic similarity of the words used in summaries would significantly help RouGE correlate better with human judgments.
A Framework for Word Embedding Based Automatic Text Summarization and Evaluation
TLDR
This paper proposes a word embedding based automatic text summarization and evaluation framework, which can successfully determine salient top-n sentences of a source text as a reference summary, and evaluate the quality of systems summaries against it.
Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning
TLDR
A new metric which covers both linguistic qualities and semantic informativeness based on BERT is designed which outperforms other metrics even without reference summaries and is general and transferable across datasets.
Extending Text Informativeness Measures to Passage Interestingness Evaluation (Language Model vs. Word Embedding)
TLDR
The concept of Interestingness is defined as a generalization of Informativeness, whereby the information need is diverse and formalized as an unknown set of implicit queries and bi-grams seems to be a key point of interestingness evaluation.
A Novel Method for Summarization and Evaluation of Messages from Twitter
TLDR
A novel method of event summarization that uses the frequently occurring sets of words and features of events related messages in a cluster to evaluate various sum-marisers using standard ROUGE based metrics is proposed.
SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization
TLDR
This work proposes SUPERT, which rates the quality of a summary by measuring its semantic similarity with a pseudo reference summary, i.e. selected salient sentences from the source documents, using contextualized embeddings and soft token alignment techniques.
Document summarization based on word associations
TLDR
The experiments indicate that the proposed method is the best-performing unsupervised summarization method in the state-of-the-art that makes no use of human-curated knowledge bases.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 12 REFERENCES
ROUGE: A Package for Automatic Evaluation of Summaries
TLDR
Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations.
Manual and automatic evaluation of summaries
TLDR
This paper shows the instability of the manual evaluation of summaries, and investigates the feasibility of automated summary evaluation based on the recent BLEU method from machine translation using accumulative n-gram overlap scores between system and human summaries.
Examining the consensus between human summaries: initial experiments with factoid analysis
We present a new approach to summary evaluation which combines two novel aspects, namely (a) content comparison between gold standard summary and system summary via factoids, a pseudo-semantic
Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics
TLDR
The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results.
Evaluation method for automatic speech summarization
TLDR
A new metric for automatic summarization results, weighted summarization accuracy (WSumACCY), which is weighted by the posterior probability of the manual summaries in the network to give the reliability of each answer extracted from the network is proposed.
Evaluating Content Selection in Summarization: The Pyramid Method
TLDR
It is argued that the method presented is reliable, predictive and diagnostic, thus improves considerably over the shortcomings of the human evaluation method currently used in the Document Understanding Conference.
Meta-evaluation of Summaries in a Cross-lingual Environment using Content-based Metrics
TLDR
A framework for the evaluation of summaries in English and Chinese using similarity measures that can be used to evaluate extractive, non-extractive, single and multi-document summarization is described.
Bleu: a Method for Automatic Evaluation of Machine Translation
TLDR
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Text Summarization Challenge 2 text summarization evaluation at NTCIR workshop 3
We report the outline of Text Summarization Challenge 2 (TSC2 hereafter), a sequel text summarization evaluation conducted as one of the tasks at the NTCIR Workshop 3. First, we describe briefly the
Bootstrap Methods and their Application
TLDR
This book gives a broad and up-to-date coverage of bootstrap methods, with numerous applied examples, developed in a coherent way with the necessary theoretical basis, including improved Monte Carlo simulation.
...
1
2
...