Evaluation of automatic summaries: metrics under varying data conditions

Authors:
Karolina Owczarzak;Hoa Trang Dang
Affiliations:
National Institute of Standards and Technology, Gaithersburg, MD;National Institute of Standards and Technology, Gaithersburg, MD
Venue:
UCNLG+Sum '09 Proceedings of the 2009 Workshop on Language Generation and Summarisation
Year:
2009

Citing 3
Cited 2

The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Examining the consensus between human summaries: initial experiments with factoid analysis

HLT-NAACL-DUC '03 Proceedings of the HLT-NAACL 03 on Text summarization workshop - Volume 5
Automatic summarising: The state of the art

Information Processing and Management: an International Journal

Multilingual summarization evaluation without human models

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Automatically assessing machine summary content without a gold standard

Computational Linguistics

Quantified Score

Hi-index	0.01

Visualization

Abstract

In evaluation of automatic summaries, it is necessary to employ multiple topics and human-produced models in order for the assessment to be stable and reliable. However, providing multiple topics and models is costly and time-consuming. This paper examines the relation between the number of available models and topics and the correlations with human judgment obtained by automatic metrics ROUGE and BE, as well as the manual Pyramid method. Testing all these methods on the same data set, taken from the TAC 2008 Summarization track, allows us to compare and contrast the methods under different conditions.