• Corpus ID: 3714849

ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks

  title={ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks},
  author={Kavita A. Ganesan},
Evaluation of summarization tasks is extremely crucial to determining the quality of machine generated summaries. Over the last decade, ROUGE has become the standard automatic evaluation measure for evaluating summarization tasks. While ROUGE has been shown to be effective in capturing n-gram overlap between system and human composed summaries, there are several limitations with the existing ROUGE measures in terms of capturing synonymous concepts and coverage of topics. Thus, often times ROUGE… 
HOLMS: Alternative Summary Evaluation with Large Language Models
This paper presents a new hybrid evaluation measure for summarization, called HOLMS, that combines both language models pre-trained on large corpora and lexical similarity measures and shows that HOLMS outperforms ROUGE and BLEU substantially in its correlation with human judgments on several extractive summarization datasets.
The Feasibility of Embedding Based Automatic Evaluation for Single Document Summarization
The experimental results show that the max value over each dimension of the summary ELMo word embeddings is a good representation that results in high correlation with human ratings, and averaging the cosine similarity of all encoders the authors tested yieldsHigh correlation with manual scores in reference-free setting.
Towards Guided Summarization of Scientific Articles: Selection of Important Update Sentences
This research tries to conduct the selection of important update sentences which is part of steps in guided summarization for the domain of scientific articles and employs and compares some selection algorithms, such as Maximum Marginal Relevance (MMR) and TextRank.
Facet-Aware Evaluation for Extractive Summarization
This paper demonstrates that facet-aware evaluation manifests better correlation with human judgment than ROUGE, enables fine-grained evaluation as well as comparative analysis, and reveals valuable insights of state-of-the-art summarization methods.
Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation
This work proposes crowdsourcing as a fast, scalable, and cost-effective alternative to expert evaluations to assess the intrinsic and extrinsic quality of summarization by comparing crowd ratings with expert ratings and automatic metrics such as ROUGE, BLEU, or BertScore on a German summarization data set.
Text document summarization using word embedding
This paper proposes an automatic summarizer using the distributional semantic model to capture semantics for producing high-quality summaries and concludes that usage of semantic as a feature for text summarization provides improved results and helps to further reduce redundancies from the input source.
InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation
This paper introduces InfoLM a family of untrained metrics that can be viewed as a string-based metric that addresses the aforementioned flaws thanks to a pre-trained masked language model and makes use of information measures allowing the adaptation of InfoLM to various evaluation criteria.
SCE-SUMMARY at the FNS 2020 shared task
With the constantly growing amount of information, the need arises to automatically summarize this written information. One of the challenges in the summary is that it’s difficult to generalize. For
Measuring Similarity of Opinion-bearing Sentences
For many NLP applications of online reviews, comparison of two opinion-bearing sentences is key. We argue that, while general purpose text similarity metrics have been applied for this purpose, there
Abstractive Summarization Using Attentive Neural Techniques
This work modify and optimize a translation model with self-attention for generating abstractive sentence summaries, and proposes a new approach based on the intuition that an abstractive model requires an Abstractive evaluation.


ROUGE: A Package for Automatic Evaluation of Summaries
Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations.
Overview of the TAC 2008 Update Summarization Task
While all of the 71 submitted runs were automatically scored with the ROUGE and BE metrics, NIST assessors manually evaluated only 57 of the submitted runs for readability, content, and overall responsiveness.
Overview of DUC 2005
The focus of DUC 2005 was on developing new evaluation methods that take into account variation in content in human-authored summaries. Therefore, DUC 2005 had a single user-oriented,
Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics
The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results.
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network
A new part-of-speech tagger is presented that demonstrates the following ideas: explicit use of both preceding and following tag contexts via a dependency network representation, broad use of lexical features, and effective use of priors in conditional loglinear models.
  • Wiley Online Library.
  • 1998
ROUGE Perl Implementation Download