Formal and functional assessment of the pyramid method for summary content evaluation*

Authors:
Rebecca j. Passonneau
Affiliations:
Center for computational learning systems, columbia university, ny 10115, usa e-mail: becky@cs.columbia.edu
Venue:
Natural Language Engineering
Year:
2010

Citing 15
Cited 3

Empirical methods for artificial intelligence

Empirical methods for artificial intelligence
Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
A corpus-based approach to comparative evaluation of statistical term association measures

Journal of the American Society for Information Science and Technology
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3)

Computational Linguistics
SUMMAC: a text summarization evaluation

Natural Language Engineering
Towards a tool for the Subjective Assessment of Speech System Interfaces (SASSI)

Natural Language Engineering
Evaluation challenges in large-scale document summarization

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Manual and automatic evaluation of summaries

AS '02 Proceedings of the ACL-02 Workshop on Automatic Summarization - Volume 4
Examining the consensus between human summaries: initial experiments with factoid analysis

HLT-NAACL-DUC '03 Proceedings of the HLT-NAACL 03 on Text summarization workshop - Volume 5
A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
The PARADISE Evaluation Framework: Issues and Findings

Computational Linguistics
The Pyramid Method: Incorporating human content selection variation in summarization evaluation

ACM Transactions on Speech and Language Processing (TSLP)
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Automatic text summarization of newswire: lessons learned from the document understanding conference

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3

Interlingual annotation of parallel text corpora: A new framework for annotation and evaluation

Natural Language Engineering
Text summarisation in progress: a literature review

Artificial Intelligence Review
Toward a gold standard for extractive text summarization

AI'10 Proceedings of the 23rd Canadian conference on Advances in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Pyramid annotation makes it possible to evaluate quantitatively and qualitatively the content of machine-generated (or human) summaries. Evaluation methods must prove themselves against the same measuring stick – evaluation – as other research methods. First, a formal assessment of pyramid data from the 2003 Document Understanding Conference (DUC) is presented; this addresses whether the form of annotation is reliable and whether score results are consistent across annotators. A combination of interannotator reliability measures of the two manual annotation phases (pyramid creation and annotation of system peer summaries against pyramid models), and significance tests of the similarity of system scores from distinct annotations, produces highly reliable results. The most rigorous test consists of a comparison of peer system rankings produced from two independent sets of pyramid and peer annotations, which produce essentially the same rankings. Three years of DUC data (2003, 2005, 2006) are used to assess the reliability of the method across distinct evaluation settings: distinct systems, document sets, summary lengths, and numbers of model summaries. This functional assessment addresses the method's ability to discriminate systems across years. Results indicate that the statistical power of the method is more than sufficient to identify statistically significant differences among systems, and that the statistical power varies little across the 3 years.