An assessment of the accuracy of automatic evaluation in summarization

  • Authors:
  • Karolina Owczarzak;John M. Conroy;Hoa Trang Dang;Ani Nenkova

  • Affiliations:
  • National Institute of Standards and Technology;IDA Center for Computing Sciences;National Institute of Standards and Technology;University of Pennsylvania

  • Venue:
  • Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatic evaluation has greatly facilitated system development in summarization. At the same time, the use of automatic evaluation has been viewed with mistrust by many, as its accuracy and correct application are not well understood. In this paper we provide an assessment of the automatic evaluations used for multi-document summarization of news. We outline our recommendations about how any evaluation, manual or automatic, should be used to find statistically significant differences between summarization systems. We identify the reference automatic evaluation metrics---ROUGE 1 and 2---that appear to best emulate human pyramid and responsiveness scores on four years of NIST evaluations. We then demonstrate the accuracy of these metrics in reproducing human judgements about the relative content quality of pairs of systems and present an empirical assessment of the relationship between statistically significant differences between systems according to manual evaluations, and the difference according to automatic evaluations. Finally, we present a case study of how new metrics should be compared to the reference evaluation, as we search for even more accurate automatic measures.