Discrepancy between automatic and manual evaluation of summaries

  • Authors:
  • Shamima Mithun;Leila Kosseim;Prasad Perera

  • Affiliations:
  • Concordia University, Montreal, Quebec, Canada;Concordia University, Montreal, Quebec, Canada;Concordia University, Montreal, Quebec, Canada

  • Venue:
  • Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Today, automatic evaluation metrics such as ROUGE have become the de-facto mode of evaluating an automatic summarization system. However, based on the DUC and the TAC evaluation results, (Conroy and Schlesinger, 2008; Dang and Owczarzak, 2008) showed that the performance gap between human-generated summaries and system-generated summaries is clearly visible in manual evaluations but is often not reflected in automated evaluations using ROUGE scores. In this paper, we present our own experiments in comparing the results of manual evaluations versus automatic evaluations using our own text summarizer: BlogSum. We have evaluated BlogSum-generated summary content using ROUGE and compared the results with the original candidate list (OList). The t-test results showed that there is no significant difference between BlogSum-generated summaries and OList summaries. However, two manual evaluations for content using two different datasets show that BlogSum performed significantly better than OList. A manual evaluation of summary coherence also shows that BlogSum performs significantly better than OList. These results agree with previous work and show the need for a better automated summary evaluation metric rather than the standard ROUGE metric.