Corpus ID: 964287

ROUGE: A Package for Automatic Evaluation of Summaries

@inproceedings{Lin2004ROUGEAP,
  title={ROUGE: A Package for Automatic Evaluation of Summaries},
  author={Chin-Yew Lin},
  booktitle={ACL 2004},
  year={2004}
}
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included… Expand
Looking for a Few Good Metrics: ROUGE and its Evaluation
TLDR
The validity of the evaluation method used in the Document Understanding Conference (DUC) is discussed and five different ROUGE metrics are evaluated included in the RouGE summarization evaluation package using data provided by DUC. Expand
Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough?
TLDR
The validity of the evaluation method used in the Document Understanding Conference (DUC) is discussed and five different ROUGE metrics are evaluated: ROUge-N, RouGE-L, ROUGEW, R OUGE-S, and ROUAGE-SU included in the Rouge summarization evaluation package using data provided by DUC. Expand
ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks
TLDR
ROUGE 2.0 is introduced, which has several updated measures of ROUGE: R OUGE-N+Synonyms, ROUge-Topic, RouGE-Topic+Synonym, R RouGE- Topic-Uniq and ROUAGE-TopicUniq+ Synonyms; all of which are improvements over the core ROU GE measures. Expand
ROUGE-C: A fully automated evaluation method for multi-document summarization
TLDR
ROUGE-C applies the ROUGE method alternatively by replacing the reference summaries with source document as well as query-focused information (if any), and therefore it enables a fully manual-independent way of evaluating multi-document summarization. Expand
The Feasibility of Embedding Based Automatic Evaluation for Single Document Summarization
TLDR
The experimental results show that the max value over each dimension of the summary ELMo word embeddings is a good representation that results in high correlation with human ratings, and averaging the cosine similarity of all encoders the authors tested yieldsHigh correlation with manual scores in reference-free setting. Expand
Approximate unsupervised summary optimisation for selections of ROUGE
TLDR
It is shown that it is possible to optimise approximately for RouGE-n by using a document-weighted ROUGE objective, which results in state-of-the-art summariser performance for single and multiple document summaries for both English and French. Expand
A Methodology for Extrinsic Evaluation of Text Summarization: Does ROUGE Correlate?
This paper demonstrates the usefulness of summaries in an extrinsic task of relevance judgment based on a new method for measuring agreement, Relevance-Prediction, which compares subjects’ judgmentsExpand
A Graph-theoretic Summary Evaluation for ROUGE
TLDR
Experimental results over TAC AESOP datasets show that exploiting the lexico-semantic similarity of the words used in summaries would significantly help ROUGE correlate better with human judgments. Expand
Better Summarization Evaluation with Word Embeddings for ROUGE
TLDR
This proposal uses word embeddings to overcome the disadvantage of ROUGE in the evaluation of abstractive summarization, or summaries with substantial paraphrasing, by using them to compute the semantic similarity of the words used in summaries instead. Expand
The ROUGE-AR : A Proposed Extension to the ROUGE Evaluation Metric for Abstractive Text Summarization
Abstractive text summarization refers to summary generation that is based on semantic understanding, and is thus not strictly limited to the words found in the source. Despite its success in deepExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 16 REFERENCES
Looking for a Few Good Metrics: ROUGE and its Evaluation
TLDR
The validity of the evaluation method used in the Document Understanding Conference (DUC) is discussed and five different ROUGE metrics are evaluated included in the RouGE summarization evaluation package using data provided by DUC. Expand
Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics
TLDR
The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results. Expand
Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics
TLDR
Two new objective automatic evaluation methods for machine translation based on longest common subsequence between a candidate translation and a set of reference translations and relaxes strict n-gram matching to skip-bigram matching are described. Expand
Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons
TLDR
This paper shows how to induce an N-best translation lexicon from a bilingual text corpus using statistical properties of the corpus together with four external knowledge sources, which improve lexicon quality by up to 137% over the plain vanilla statistical method, and approach human performance. Expand
Automatic Summarization
TLDR
Experimental results show that the proposed automatic speech summarization technique for English effectively extracts relatively important information and remove redundant and irrelevant information from English news speech. Expand
Bleu: a Method for Automatic Evaluation of Machine Translation
TLDR
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. Expand
Meta-evaluation of Summaries in a Cross-lingual Environment using Content-based Metrics
TLDR
A framework for the evaluation of summaries in English and Chinese using similarity measures that can be used to evaluate extractive, non-extractive, single and multi-document summarization is described. Expand
Precision and Recall of Machine Translation
TLDR
Machine translation can be evaluated using precision, recall, and the F-measure, which have significantly higher correlation with human judgments than recently proposed alternatives. Expand
Bootstrap Methods and Their Application
TLDR
Certain selection criteria developed for traditional regression and time series models, when naively applied to certain nonnormal settings, appear to perform at least as well as selection criteria specifically designed for those settings. Expand
Workshop on Text Summarization Branches
  • 2004
...
1
2
...