Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics

@inproceedings{Lin2003AutomaticEO,
  title={Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics},
  author={Chin-Yew Lin and Eduard H. Hovy},
  booktitle={NAACL},
  year={2003}
}
Following the recent adoption by the machine translation community of automatic evaluation using the BLEU/NIST scoring process, we conduct an in-depth study of a similar idea for evaluating summaries. The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results. 

Figures, Tables, and Topics from this paper

A Unified Framework For Automatic Evaluation Using 4-Gram Co-occurrence Statistics
TLDR
A unified framework for automatic evaluation of NLP applications using N-gram co-occurrence statistics is proposed, showing that different members of the same family of metrics explain best the variations obtained with human evaluations, according to the application being evaluated.
Analysis of Automated Evaluation for Multi-document Summarization Using Content-Based Similarity
  • Li-qing Qiu, Bin Pang
  • Computer Science
    Second International Conference on the Digital Society
  • 2008
TLDR
An automated evaluation method based on content similarity is introduced, and a vector space of words is constructed, on which the cosine similarity of automated summaries and human summaries is computed.
An Extensive Empirical Study of Automated Evaluation of Multi-Document Summarization
TLDR
An approach to automated evaluation of multi-document summarization by computing the similarities of automated summaries and human summary and scoring the automated summary by their similarities to the human ones is discussed.
Evaluating N-gram based Evaluation Metrics for Automatic Keyphrase Extraction
TLDR
This paper describes a feasibility study of n-gram-based evaluation metrics for automatic keyphrase extraction, and adapt various evaluation metrics developed for machine translation and summarization, and also the R-precision evaluation metric from keyphrase evaluation.
An Information-Theoretic Approach to Automatic Evaluation of Summaries
TLDR
This paper introduces an information-theoretic approach to automatic evaluation of summaries based on the Jensen-Shannon divergence of distributions between an automatic summary and a set of reference summaries and results indicate that JS divergence-based evaluation method achieves comparable performance with the common automatic evaluation method ROUGE in single documents summarization task.
Re-using High-quality Resources for Continued Evaluation of Automated Summarization Systems
TLDR
This approach enhances the standard n-gram based evaluation of automatic summarization systems by establishing similarities between extractive (vs. abstractive) summaries and by taking advantage of the big quantity of evaluated summaries available from the DUC contest.
Kernel-based Approach for Automatic Evaluation of Natural Language Generation Technologies: Application to Automatic Summarization
TLDR
An evaluation method that is based on convolution kernels that measure the similarities between texts considering their substructures is presented that correlates more closely with human evaluations and is more robust.
Evaluating Automatic Summaries of Meeting Recordings
TLDR
The research below explores schemes for evaluating automatic summaries of business meetings, using the ICSI Meeting Corpus, with a central interest being whether or not the two types of evaluations correlate with each other.
The significance of recall in automatic metrics for MT evaluation
TLDR
This work shows that correlation with human judgments is highest when almost all of the weight is assigned to recall, and shows that stemming is significantly beneficial not just to simpler unigram precision and recall based metrics, but also to BLEU and NIST.
Vocabulary Usage in Newswire Summaries
Analysis of 9000 manually written summaries of newswire stories used in four Document Understanding Conferences indicates that approximately 40% of their lexical items do not occur in the source
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 20 REFERENCES
Manual and automatic evaluation of summaries
TLDR
This paper shows the instability of the manual evaluation of summaries, and investigates the feasibility of automated summary evaluation based on the recent BLEU method from machine translation using accumulative n-gram overlap scores between system and human summaries.
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
TLDR
NIST commissioned NIST to develop an MT evaluation facility based on the IBM work, which is now available from NIST and serves as the primary evaluation measure for TIDES MT research.
A Comparison of Rankings Produced by Summarization Evaluation Measures
TLDR
This paper proposes using sentence-rank-based and content-based measures for evaluating extract summaries, and compares these with recall-based evaluation measures.
The TIPSTER SUMMAC Text Summarization Evaluation
The TIPSTER Text Summarization Evaluation (SUMMAC) has established definitively that automatic text summarization is very effective in relevance assessment tasks. Summaries as short as 17% of full
Bleu: a Method for Automatic Evaluation of Machine Translation
TLDR
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
The formation of abstracts by the selection of sentences
TLDR
There was very little agreement between the subjects and machine methods in their selection of representative sentences, and human selection of sentences is considerably more variable than the machine methods.
Tracking and summarizing news on a daily basis with Columbia's Newsblaster
TLDR
Columbia's Newsblaster system for online news summarization is presented, a system that crawls the web for news articles, clusters them on specific topics and produces multidocument summaries for each cluster.
NewsInEssence: A System For Domain-Independent, Real-Time News Clustering and Multi-Document Summarization
TLDR
A system for finding, visualizing and summarizing a topic-based cluster of news stories and producing summaries of a subset of the stories that it finds, according to parameters specified by the user.
Evaluating Natural Language Processing Systems: An Analysis and Review
TLDR
This comprehensive state-of-the-art book is the first devoted to the important and timely issue of evaluating NLP systems, and provides a wide-ranging and careful analysis of evaluation concepts, reinforced with extensive illustrations.
Text summarization challenge 2: text summarization evaluation at NTCIR workshop 3
We describe the outline of Text Summarization Challenge 2 (TSC2 hereafter), a sequel text summarization evaluation conducted as one of the tasks at the NTCIR Workshop 3. First, we describe briefly
...
1
2
...