Empirical methods for artificial intelligence
Empirical methods for artificial intelligence
Assessing agreement on classification tasks: the kappa statistic
Computational Linguistics
A corpus-based approach to comparative evaluation of statistical term association measures
Journal of the American Society for Information Science and Technology
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
SUMMAC: a text summarization evaluation
Natural Language Engineering
Towards a tool for the Subjective Assessment of Speech System Interfaces (SASSI)
Natural Language Engineering
Evaluation challenges in large-scale document summarization
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Manual and automatic evaluation of summaries
AS '02 Proceedings of the ACL-02 Workshop on Automatic Summarization - Volume 4
Examining the consensus between human summaries: initial experiments with factoid analysis
HLT-NAACL-DUC '03 Proceedings of the HLT-NAACL 03 on Text summarization workshop - Volume 5
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
The PARADISE Evaluation Framework: Issues and Findings
Computational Linguistics
The Pyramid Method: Incorporating human content selection variation in summarization evaluation
ACM Transactions on Speech and Language Processing (TSLP)
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Automatic text summarization of newswire: lessons learned from the document understanding conference
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3
Interlingual annotation of parallel text corpora: A new framework for annotation and evaluation
Natural Language Engineering
Text summarisation in progress: a literature review
Artificial Intelligence Review
Toward a gold standard for extractive text summarization
AI'10 Proceedings of the 23rd Canadian conference on Advances in Artificial Intelligence
Hi-index | 0.00 |
Pyramid annotation makes it possible to evaluate quantitatively and qualitatively the content of machine-generated (or human) summaries. Evaluation methods must prove themselves against the same measuring stick – evaluation – as other research methods. First, a formal assessment of pyramid data from the 2003 Document Understanding Conference (DUC) is presented; this addresses whether the form of annotation is reliable and whether score results are consistent across annotators. A combination of interannotator reliability measures of the two manual annotation phases (pyramid creation and annotation of system peer summaries against pyramid models), and significance tests of the similarity of system scores from distinct annotations, produces highly reliable results. The most rigorous test consists of a comparison of peer system rankings produced from two independent sets of pyramid and peer annotations, which produce essentially the same rankings. Three years of DUC data (2003, 2005, 2006) are used to assess the reliability of the method across distinct evaluation settings: distinct systems, document sets, summary lengths, and numbers of model summaries. This functional assessment addresses the method's ability to discriminate systems across years. Results indicate that the statistical power of the method is more than sufficient to identify statistically significant differences among systems, and that the statistical power varies little across the 3 years.