• Corpus ID: 964287

ROUGE: A Package for Automatic Evaluation of Summaries

@inproceedings{Lin2004ROUGEAP,
  title={ROUGE: A Package for Automatic Evaluation of Summaries},
  author={Chin-Yew Lin},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
  year={2004}
}
  • Chin-Yew Lin
  • Published in
    Annual Meeting of the…
    25 July 2004
  • Computer Science
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included… 

Tables from this paper

Looking for a Few Good Metrics: ROUGE and its Evaluation

The validity of the evaluation method used in the Document Understanding Conference (DUC) is discussed and five different ROUGE metrics are evaluated included in the RouGE summarization evaluation package using data provided by DUC.

Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough?

The validity of the evaluation method used in the Document Understanding Conference (DUC) is discussed and five different ROUGE metrics are evaluated: ROUge-N, RouGE-L, ROUGEW, R OUGE-S, and ROUAGE-SU included in the Rouge summarization evaluation package using data provided by DUC.

ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks

ROUGE 2.0 is introduced, which has several updated measures of ROUGE: R OUGE-N+Synonyms, ROUge-Topic, RouGE-Topic+Synonym, R RouGE- Topic-Uniq and ROUAGE-TopicUniq+ Synonyms; all of which are improvements over the core ROU GE measures.

ROUGE-C: A fully automated evaluation method for multi-document summarization

ROUGE-C applies the ROUGE method alternatively by replacing the reference summaries with source document as well as query-focused information (if any), and therefore it enables a fully manual-independent way of evaluating multi-document summarization.

Approximate unsupervised summary optimisation for selections of ROUGE

It is shown that it is possible to optimise approximately for RouGE-n by using a document-weighted ROUGE objective, which results in state-of-the-art summariser performance for single and multiple document summaries for both English and French.

A Methodology for Extrinsic Evaluation of Text Summarization: Does ROUGE Correlate?

This paper demonstrates the usefulness of summaries in an extrinsic task of relevance judgment based on a new method for measuring agreement, Relevance-Prediction, which compares subjects’ judgments

A Graph-theoretic Summary Evaluation for ROUGE

Experimental results over TAC AESOP datasets show that exploiting the lexico-semantic similarity of the words used in summaries would significantly help ROUGE correlate better with human judgments.

Better Summarization Evaluation with Word Embeddings for ROUGE

This proposal uses word embeddings to overcome the disadvantage of ROUGE in the evaluation of abstractive summarization, or summaries with substantial paraphrasing, by using them to compute the semantic similarity of the words used in summaries instead.

The ROUGE-AR : A Proposed Extension to the ROUGE Evaluation Metric for Abstractive Text Summarization

The ROUGE-AR metric reweights the final RouGE output by incorporating both anaphor resolution and other intrinsic methods that are largely absent from non-human text summary evaluation.

An Extensive Empirical Study of Automated Evaluation of Multi-Document Summarization

An approach to automated evaluation of multi-document summarization by computing the similarities of automated summaries and human summary and scoring the automated summary by their similarities to the human ones is discussed.
...

References

SHOWING 1-10 OF 13 REFERENCES

Looking for a Few Good Metrics: ROUGE and its Evaluation

The validity of the evaluation method used in the Document Understanding Conference (DUC) is discussed and five different ROUGE metrics are evaluated included in the RouGE summarization evaluation package using data provided by DUC.

Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics

The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results.

Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics

Two new objective automatic evaluation methods for machine translation based on longest common subsequence between a candidate translation and a set of reference translations and relaxes strict n-gram matching to skip-bigram matching are described.

Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons

This paper shows how to induce an N-best translation lexicon from a bilingual text corpus using statistical properties of the corpus together with four external knowledge sources, which improve lexicon quality by up to 137% over the plain vanilla statistical method, and approach human performance.

Automatic Summarization

Experimental results show that the proposed automatic speech summarization technique for English effectively extracts relatively important information and remove redundant and irrelevant information from English news speech.

Bleu: a Method for Automatic Evaluation of Machine Translation

This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

Meta-evaluation of Summaries in a Cross-lingual Environment using Content-based Metrics

A framework for the evaluation of summaries in English and Chinese using similarity measures that can be used to evaluate extractive, non-extractive, single and multi-document summarization is described.

Precision and Recall of Machine Translation

Machine translation can be evaluated using precision, recall, and the F-measure, which have significantly higher correlation with human judgments than recently proposed alternatives.

Bootstrap Methods and Their Application

Certain selection criteria developed for traditional regression and time series models, when naively applied to certain nonnormal settings, appear to perform at least as well as selection criteria specifically designed for those settings.

Workshop on Text Summarization Branches

  • 2004