An investigation into the validity of some metrics for automatically evaluating natural language generation systems

Authors:
Ehud Reiter;Anja Belz
Affiliations:
-;-
Venue:
Computational Linguistics
Year:
2009

Citing 29
Cited 16

Using Grice's maxim of quantity to select the content of plan descriptions

Artificial Intelligence
Evaluating Natural Language Processing Systems: An Analysis and Review

Evaluating Natural Language Processing Systems: An Analysis and Review
Using Natural-Language Processing to Produce Weather Forecasts

IEEE Expert: Intelligent Systems and Their Applications
Lessons from a failure: generating tailored smoking cessation letters

Artificial Intelligence
Developing and empirically evaluating robust explanation generators: the KNIGHT experiments

Computational Linguistics
Do the right thing . . . but expect the unexpected

Computational Linguistics - Special issue on natural language generation
Generation that exploits corpus-based statistical knowledge

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Evaluation metrics for generation

INLG '00 Proceedings of the first international conference on Natural language generation - Volume 14
Learning the meaning and usage of time phrases from a parallel text-data corpus

HLT-NAACL-LWM '04 Proceedings of the HLT-NAACL 2003 workshop on Learning word meaning from non-linguistic data - Volume 6
Robust PCFG-based generation using automatically acquired LFG approximations

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

HLT '02 Proceedings of the second international conference on Human Language Technology Research
That's nice... what can you do with it?

Computational Linguistics
Automatic generation of textual summaries from neonatal intensive care data

Artificial Intelligence
Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models

Natural Language Engineering
Generating basic skills reports for low-skilled readers*

Natural Language Engineering
Intrinsic vs. extrinsic evaluation measures for referring expression generation

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
The TUNA-REG Challenge 2009: overview and evaluation results

ENLG '09 Proceedings of the 12th European Workshop on Natural Language Generation
Acquiring correct knowledge for natural language generation

Journal of Artificial Intelligence Research
Further meta-evaluation of machine translation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Evaluating coverage for large symbolic NLG grammars

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Choosing words in computer-generated weather forecasts

Artificial Intelligence - Special volume on connecting language to the world
Generating and evaluating evaluative arguments

Artificial Intelligence
DUC 2005: evaluation of question-focused summarization systems

SumQA '06 Proceedings of the Workshop on Task-Focused Summarization and Question Answering
Building a large-scale commercial NLG system for an EMR

INLG '08 Proceedings of the Fifth International Natural Language Generation Conference
The TUNA challenge 2008: overview and evaluation results

INLG '08 Proceedings of the Fifth International Natural Language Generation Conference
Evaluating evaluation methods for generation in the presence of variation

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

Correlating human and automatic evaluation of a German surface realiser

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Phrase-based statistical language generation using graphical models and active learning

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Further meta-evaluation of broad-coverage surface realization

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Textual properties and task based evaluation: investigating the role of surface properties, structure and content

INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Towards an extrinsic evaluation of referring expressions in situated dialogs

INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Assessing the trade-off between system building cost and output quality in data-to-text generation

Empirical methods in natural language generation
Introducing shared tasks to NLG: the TUNA shared task evaluation challenges

Empirical methods in natural language generation
What is in a text and what does it do: qualitative evaluations of an NLG system -- the BT-Nurse -- using content analysis and discourse analysis

ENLG '11 Proceedings of the 13th European Workshop on Natural Language Generation
Generation of formal and informal sentences

ENLG '11 Proceedings of the 13th European Workshop on Natural Language Generation
Midge: generating image descriptions from computer vision detections

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Learning preferences for referring expression generation: effects of domain, language and algorithm

INLG '12 Proceedings of the Seventh International Natural Language Generation Conference
Automatic generation of natural language nursing shift summaries in neonatal intensive care: BT-Nurse

Artificial Intelligence in Medicine
Assessing the influence of personal preferences on the choice of vocabulary for natural language generation

Information Processing and Management: an International Journal
Artificial Speech and Its Authors

Minds and Machines
Framing image description as a ranking task: data, models and evaluation metrics

Journal of Artificial Intelligence Research
A task-performance evaluation of referring expressions in situated collaborative task dialogues

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous work on NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE) correlate with human judgments in the domain of computer-generated weather forecasts. Our results suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as we would ideally like to see; however, they do not provide a useful measure of content quality. We also discuss a number of caveats which must be kept in mind when interpreting this and other validation studies.