Automatic evaluation of summaries using N-gram co-occurrence statistics
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
The Pyramid Method: Incorporating human content selection variation in summarization evaluation
ACM Transactions on Speech and Language Processing (TSLP)
Information Processing and Management: an International Journal
A skip-chain conditional random field for ranking meeting utterances by importance
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Automatic text summarization of newswire: lessons learned from the document understanding conference
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3
Exploring correlation between ROUGE and human evaluation on meeting summaries
IEEE Transactions on Audio, Speech, and Language Processing
Automatic evaluation of linguistic quality in multi-document summarization
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Ranking human and machine summarization systems
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Summary evaluation: together we stand NPowER-ed
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Hi-index | 0.00 |
Automatic evaluation has greatly facilitated system development in summarization. At the same time, the use of automatic evaluation has been viewed with mistrust by many, as its accuracy and correct application are not well understood. In this paper we provide an assessment of the automatic evaluations used for multi-document summarization of news. We outline our recommendations about how any evaluation, manual or automatic, should be used to find statistically significant differences between summarization systems. We identify the reference automatic evaluation metrics---ROUGE 1 and 2---that appear to best emulate human pyramid and responsiveness scores on four years of NIST evaluations. We then demonstrate the accuracy of these metrics in reproducing human judgements about the relative content quality of pairs of systems and present an empirical assessment of the relationship between statistically significant differences between systems according to manual evaluations, and the difference according to automatic evaluations. Finally, we present a case study of how new metrics should be compared to the reference evaluation, as we search for even more accurate automatic measures.