An assessment of the accuracy of automatic evaluation in summarization

Authors:
Karolina Owczarzak;John M. Conroy;Hoa Trang Dang;Ani Nenkova
Affiliations:
National Institute of Standards and Technology;IDA Center for Computing Sciences;National Institute of Standards and Technology;University of Pennsylvania
Venue:
Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization
Year:
2012

Citing 8
Cited 1

Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
The Pyramid Method: Incorporating human content selection variation in summarization evaluation

ACM Transactions on Speech and Language Processing (TSLP)
DUC in context

Information Processing and Management: an International Journal
A skip-chain conditional random field for ranking meeting utterances by importance

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Automatic text summarization of newswire: lessons learned from the document understanding conference

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3
Exploring correlation between ROUGE and human evaluation on meeting summaries

IEEE Transactions on Audio, Speech, and Language Processing
Automatic evaluation of linguistic quality in multi-document summarization

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Ranking human and machine summarization systems

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Summary evaluation: together we stand NPowER-ed

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic evaluation has greatly facilitated system development in summarization. At the same time, the use of automatic evaluation has been viewed with mistrust by many, as its accuracy and correct application are not well understood. In this paper we provide an assessment of the automatic evaluations used for multi-document summarization of news. We outline our recommendations about how any evaluation, manual or automatic, should be used to find statistically significant differences between summarization systems. We identify the reference automatic evaluation metrics---ROUGE 1 and 2---that appear to best emulate human pyramid and responsiveness scores on four years of NIST evaluations. We then demonstrate the accuracy of these metrics in reproducing human judgements about the relative content quality of pairs of systems and present an empirical assessment of the relationship between statistically significant differences between systems according to manual evaluations, and the difference according to automatic evaluations. Finally, we present a case study of how new metrics should be compared to the reference evaluation, as we search for even more accurate automatic measures.