The Pyramid Method: Incorporating human content selection variation in summarization evaluation

@article{Nenkova2007ThePM,
  title={The Pyramid Method: Incorporating human content selection variation in summarization evaluation},
  author={A. Nenkova and R. Passonneau and K. McKeown},
  journal={ACM Trans. Speech Lang. Process.},
  year={2007},
  volume={4},
  pages={4}
}
Human variation in content selection in summarization has given rise to some fundamental research questions: How can one incorporate the observed variation in suitable evaluation measures. [...] Key Method It serves as the basis for an evaluation method, the Pyramid Method, that incorporates the observed variation and is predictive of different equally informative summaries. We discuss the reliability of content unit annotation, the properties of Pyramid scores, and their correlation with other evaluation…Expand
Automatically Evaluating Content Selection in Summarization without Human Models
TLDR
This work capitalizes on the assumption that the distribution of words in the input and an informative summary of that input should be similar to each other, and ranks participating systems similarly to manual model-based pyramid evaluation and to manual human judgments of responsiveness. Expand
PEAK: Pyramid Evaluation via Automated Knowledge Extraction
TLDR
PEAK is proposed, the first method to automatically assess summary content using the pyramid method that also generates the pyramid content models, and relies on open information extraction and graph algorithms. Expand
Learning to Score System Summaries for Better Content Selection Evaluation.
TLDR
This work proposes to learn an automatic scoring metric based on the human judgements available as part of classical summarization datasets like TAC-2008 and Tac-2009, and releases the trained metric as an open-source tool. Expand
Learning Summary Content Units with Topic Modeling
TLDR
The results show that the topic model identifies topic-sentence associations that correspond to the contributors of SCUs, suggesting that the topics modeling approach can generate a viable set of candidate SCUs for facilitating the creation of Pyramids. Expand
Content selection in multi-document summarization
TLDR
It is shown that a modular extractive summarizer using the estimates of word importance can generate summaries comparable to the state-of-the-art systems, and a new framework of system combination for multi-document summarization is presented. Expand
Pyramid-based Summary Evaluation Using Abstract Meaning Representation
TLDR
The proposed metric complements well the widely-used ROUGE metrics and automatizes the evaluation process, which does not need any manual intervention on the evaluated summary side. Expand
Automatic Summary Evaluation without Human Models
TLDR
These results on a large scale evaluation from the Text Analysis Conference show that input-summary comparisons can be very effective and can be used to rank participating systems very similarly to manual model-based evaluations as well as to manual human judgments of summary quality without reference to a model. Expand
Revisiting Summarization Evaluation for Scientific Articles
TLDR
It is shown that, contrary to the common belief, ROUGE is not much reliable in evaluating scientific summaries, and an alternative metric is proposed which is based on the content relevance between a system generated summary and the corresponding human written summaries. Expand
PyrEval: An Automated Method for Summary Content Analysis
TLDR
PyrEval automates the manual pyramid method and uses low-dimension distributional semantics to represent phrase meanings, and a new algorithm, EDUA (Emergent Discovery of Units of Attraction), to solve a set cover problem to construct the content model from vectorized phrases. Expand
A Simple Theoretical Model of Importance for Summarization
TLDR
It is argued that establishing theoretical models of Importance will advance the understanding of the task and help to further improve summarization systems, and proposes simple but rigorous definitions of several concepts that were previously used only intuitively in summarization: Redundancy, Relevance, and Informativeness. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 44 REFERENCES
Evaluating Content Selection in Summarization: The Pyramid Method
TLDR
It is argued that the method presented is reliable, predictive and diagnostic, thus improves considerably over the shortcomings of the human evaluation method currently used in the Document Understanding Conference. Expand
Summarization Evaluation Methods: Experiments and Analysis
TLDR
The results show that different parameters of an experiment can affect how well a system scores, and describe how parameters can be controlled to produce a sound evaluation. Expand
Summarizing text documents: sentence selection and evaluation metrics
TLDR
An analysis of news-article summaries generated by sentence selection, using a normalized version of precision-recall curves with a baseline of random sentence selection to evaluate features and empirical results show the importance of corpus-dependent baseline summarization standards, compression ratios and carefully crafted long queries. Expand
Evaluating Summaries and Answers: Two Sides of the Same Coin?
TLDR
It is the opinion that question answering and multi-document summarization represent two complementary approaches to the same problem of satisfying complex user information needs, and implications for system evaluation are focused on. Expand
Single-document and multi-document summary evaluation using Relative Utility
TLDR
This work presents a series of experiments to demonstrate the validity of Relative Utility (RU) as a measure for evaluating extractive summarization systems and indicates that Relative Utility is a reasonable, and often superior alternative to several common summary evaluation metrics. Expand
Evaluating Information Content by Factoid Analysis: Human annotation and stability
TLDR
It is shown that factoid annotation is highly reproducible, introduced a weighted factoid score, estimate how many summaries are required for stable system rankings, and show that the factoid scores cannot be sufficiently approximated by unigrams and the DUC information overlap measure. Expand
Examining the consensus between human summaries: initial experiments with factoid analysis
We present a new approach to summary evaluation which combines two novel aspects, namely (a) content comparison between gold standard summary and system summary via factoids, a pseudo-semanticExpand
The Pyramid Method
Human variation in content selection in summarization has given rise to some fundamental research questions: How can one incorporate the observed variation in suitable evaluation measures How can s...
Applying the Pyramid Method in DUC 2005
TLDR
It is found that a modified pyramid score gave good results and would simplify peer annotation in the future and high score correlations between sets from different annotators, and good interannotator agreement, indicate that participants can perform annotation reliably. Expand
ROUGE: A Package for Automatic Evaluation of Summaries
TLDR
Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations. Expand
...
1
2
3
4
5
...