Lessons learned from large scale evaluation of systems that produce text: nightmares and pleasant surprises

Authors:
Kathleen R. McKeown
Affiliations:
Columbia University, New York, NY
Venue:
INLG '06 Proceedings of the Fourth International Natural Language Generation Conference
Year:
2006

Citing 4
Cited 0

Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Examining the consensus between human summaries: initial experiments with factoid analysis

HLT-NAACL-DUC '03 Proceedings of the HLT-NAACL 03 on Text summarization workshop - Volume 5
Understanding the process of multi-document summarization: content selection, rewriting and evaluation

Understanding the process of multi-document summarization: content selection, rewriting and evaluation
Automatic text summarization of newswire: lessons learned from the document understanding conference

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the language generation community explores the possibility of an evaluation program for language generation, it behooves us to examine our experience in evaluation of other systems that produce text as output. Large scale evaluation of summarization systems and of question answering systems has been carried out for several years now. Summarization and question answering systems produce text output given text as input, while language generation produces text from a semantic representation. Given that the output has the same properties, we can learn from the mistakes and the understandings gained in earlier evaluations. In this invited talk, I will discuss what we have learned in the large scale summarization evaluations carried out in the Document Understanding Conferences (DUC) from 2001 to present, and in the large scale question answering evaluations carried out in TREC (e.g., the definition pilot) as well as the new large scale evaluations being carried out in the DARPA GALE (Global Autonomous Language Environment) program.