Non-expert evaluation of summarization systems is risky

  • Authors:
  • Dan Gillick;Yang Liu

  • Affiliations:
  • University of California, Berkeley;University of Texas, Dallas

  • Venue:
  • CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We provide evidence that intrinsic evaluation of summaries using Amazon's Mechanical Turk is quite difficult. Experiments mirroring evaluation at the Text Analysis Conference's summarization track show that non-expert judges are not able to recover system rankings derived from experts.