The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Examining the consensus between human summaries: initial experiments with factoid analysis
HLT-NAACL-DUC '03 Proceedings of the HLT-NAACL 03 on Text summarization workshop - Volume 5
Automatic summarising: The state of the art
Information Processing and Management: an International Journal
Multilingual summarization evaluation without human models
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Automatically assessing machine summary content without a gold standard
Computational Linguistics
Hi-index | 0.01 |
In evaluation of automatic summaries, it is necessary to employ multiple topics and human-produced models in order for the assessment to be stable and reliable. However, providing multiple topics and models is costly and time-consuming. This paper examines the relation between the number of available models and topics and the correlations with human judgment obtained by automatic metrics ROUGE and BE, as well as the manual Pyramid method. Testing all these methods on the same data set, taken from the TAC 2008 Summarization track, allows us to compare and contrast the methods under different conditions.