A Graph-theoretic Summary Evaluation for ROUGE

@inproceedings{Shafieibavani2018AGS,
  title={A Graph-theoretic Summary Evaluation for ROUGE},
  author={Elaheh Shafieibavani and Mohammad Ebrahimi and Raymond K. Wong and Fang Chen},
  booktitle={EMNLP},
  year={2018}
}
ROUGE is one of the first and most widely used evaluation metrics for text summarization. [...] Key Result Experiment results over TAC AESOP datasets show that exploiting the lexico-semantic similarity of the words used in summaries would significantly help ROUGE correlate better with human judgments.Expand
The Feasibility of Embedding Based Automatic Evaluation for Single Document Summarization
TLDR
The experimental results show that the max value over each dimension of the summary ELMo word embeddings is a good representation that results in high correlation with human ratings, and averaging the cosine similarity of all encoders the authors tested yieldsHigh correlation with manual scores in reference-free setting.
An Anchor-Based Automatic Evaluation Metric for Document Summarization
TLDR
A new protocol for designing reference-based metrics that require the endorsement of source document(s) is considered and an anchored ROUGE metric fixing each summary particle on source document, which bases the computation on more solid ground is proposed.
SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization
TLDR
This work proposes SUPERT, which rates the quality of a summary by measuring its semantic similarity with a pseudo reference summary, i.e. selected salient sentences from the source documents, using contextualized embeddings and soft token alignment techniques.
Facet-Aware Evaluation for Extractive Summarization
TLDR
This paper demonstrates that facet-aware evaluation manifests better correlation with human judgment than ROUGE, enables fine-grained evaluation as well as comparative analysis, and reveals valuable insights of state-of-the-art summarization methods.
GRUEN for Evaluating Linguistic Quality of Generated Text
TLDR
Experiments show that the proposed GRUEN metric correlates highly with human judgments, and has the advantage of being unsupervised, deterministic, and adaptable to various tasks.
Neural Text Summarization: A Critical Evaluation
TLDR
This work critically evaluate key ingredients of the current research setup: datasets, evaluation metrics, and models, and highlights three primary shortcomings: automatically collected datasets leave the task underconstrained and may contain noise detrimental to training and evaluation.
Context-Aware Model of Abstractive Text Summarization for Research Articles
Research article comprises of different sections each holds its own characteristic domain information. Summarization of entire article from multiple documents of multiple sections in precise form
Multi-document Summarization via Deep Learning Techniques: A Survey
TLDR
This survey structurally overviews the recent deep learning based multi-document summarization models via a proposed taxonomy and proposes a novel mechanism to summarize the design strategies of neural networks and conduct a comprehensive summary of the state-of-the-art.
GU-ISS-2019-03 Assessing the quality of Språkbanken ’ s annotations
Most of the corpora in Språkbanken Text consist of unannotated plain text, such as almost all newspaper texts, social media texts, novels and official documents. We also have some corpora that are
A critical analysis of metrics used for measuring progress in artificial intelligence
TLDR
The results suggest that the large majority of metrics currently used to evaluate classification AI benchmark tasks have properties that may result in an inadequate reflection of a classifiers' performance, especially when used with imbalanced datasets.
...
1
2
...

References

SHOWING 1-10 OF 24 REFERENCES
A Semantically Motivated Approach to Compute ROUGE Scores
TLDR
A graph-based algorithm is adopted into ROUGE to capture the semantic similarities between peer and model summaries and indicates that exploiting the lexico-semantic similarity of the words used in summaries would significantly help RouGE correlate better with human judgments.
Better Summarization Evaluation with Word Embeddings for ROUGE
TLDR
This proposal uses word embeddings to overcome the disadvantage of ROUGE in the evaluation of abstractive summarization, or summaries with substantial paraphrasing, by using them to compute the semantic similarity of the words used in summaries instead.
ROUGE: A Package for Automatic Evaluation of Summaries
TLDR
Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations.
Looking for a Few Good Metrics: ROUGE and its Evaluation
TLDR
The validity of the evaluation method used in the Document Understanding Conference (DUC) is discussed and five different ROUGE metrics are evaluated included in the RouGE summarization evaluation package using data provided by DUC.
Summarization Evaluation in the Absence of Human Model Summaries Using the Compositionality of Word Embeddings
TLDR
The proposed metric is evaluated in replicating the human assigned scores for summarization systems and summaries on data from query-focused and update summarization tasks in TAC 2008 and 2009.
Summarization system evaluation revisited: N-gram graphs
TLDR
A novel automatic method for the evaluation of summarization systems, based on comparing the character n-gram graphs representation of the extracted summaries and a number of model summaries, which appears to hold a level of evaluation performance that matches and even exceeds other contemporary evaluation methods.
Evaluation Measures for Text Summarization
TLDR
A new evaluation measure for assessing the quality of a summary that can compare a summary with its full text and if abstracts are not available for a given corpus, using the LSA-based measure is an appropriate choice.
Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE
TLDR
An analysis of current evaluation methodologies applied to summarization metrics reveals for the first time which metric variants significantly outperform others, optimal metric variants distinct from current recommended best variants, as well as machine translation metric BLEU to have performance on-par with ROUGE for the purpose of evaluation of summarization systems.
Automated Summarization Evaluation with Basic Elements
TLDR
This paper describes a framework in which summary evaluation measures can be instantiated and compared, and implements a specific evaluation method using very small units of content, called Basic Elements that address some of the shortcomings of ngrams.
CoSimRank: A Flexible & Efficient Graph-Theoretic Similarity Measure
We present CoSimRank, a graph-theoretic similarity measure that is efficient because it can compute a single node similarity without having to compute the similarities of the entire graph. We present
...
1
2
3
...