Lessons learned from large scale evaluation of systems that produce text: nightmares and pleasant surprises

  • Authors:
  • Kathleen R. McKeown

  • Affiliations:
  • Columbia University, New York, NY

  • Venue:
  • INLG '06 Proceedings of the Fourth International Natural Language Generation Conference
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

As the language generation community explores the possibility of an evaluation program for language generation, it behooves us to examine our experience in evaluation of other systems that produce text as output. Large scale evaluation of summarization systems and of question answering systems has been carried out for several years now. Summarization and question answering systems produce text output given text as input, while language generation produces text from a semantic representation. Given that the output has the same properties, we can learn from the mistakes and the understandings gained in earlier evaluations. In this invited talk, I will discuss what we have learned in the large scale summarization evaluations carried out in the Document Understanding Conferences (DUC) from 2001 to present, and in the large scale question answering evaluations carried out in TREC (e.g., the definition pilot) as well as the new large scale evaluations being carried out in the DARPA GALE (Global Autonomous Language Environment) program.