Discrepancy between automatic and manual evaluation of summaries

Authors:
Shamima Mithun;Leila Kosseim;Prasad Perera
Affiliations:
Concordia University, Montreal, Quebec, Canada;Concordia University, Montreal, Quebec, Canada;Concordia University, Montreal, Quebec, Canada
Venue:
Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization
Year:
2012

Citing 2
Cited 0

Identifying comparative sentences in text documents

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Mind the gap: dangers of divorcing evaluations of summary content from linguistic quality

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today, automatic evaluation metrics such as ROUGE have become the de-facto mode of evaluating an automatic summarization system. However, based on the DUC and the TAC evaluation results, (Conroy and Schlesinger, 2008; Dang and Owczarzak, 2008) showed that the performance gap between human-generated summaries and system-generated summaries is clearly visible in manual evaluations but is often not reflected in automated evaluations using ROUGE scores. In this paper, we present our own experiments in comparing the results of manual evaluations versus automatic evaluations using our own text summarizer: BlogSum. We have evaluated BlogSum-generated summary content using ROUGE and compared the results with the original candidate list (OList). The t-test results showed that there is no significant difference between BlogSum-generated summaries and OList summaries. However, two manual evaluations for content using two different datasets show that BlogSum performed significantly better than OList. A manual evaluation of summary coherence also shows that BlogSum performs significantly better than OList. These results agree with previous work and show the need for a better automated summary evaluation metric rather than the standard ROUGE metric.