Using Grice's maxim of quantity to select the content of plan descriptions
Artificial Intelligence
Evaluating Natural Language Processing Systems: An Analysis and Review
Evaluating Natural Language Processing Systems: An Analysis and Review
Using Natural-Language Processing to Produce Weather Forecasts
IEEE Expert: Intelligent Systems and Their Applications
Lessons from a failure: generating tailored smoking cessation letters
Artificial Intelligence
Developing and empirically evaluating robust explanation generators: the KNIGHT experiments
Computational Linguistics
Do the right thing . . . but expect the unexpected
Computational Linguistics - Special issue on natural language generation
Generation that exploits corpus-based statistical knowledge
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
BLEU: a method for automatic evaluation of machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic evaluation of summaries using N-gram co-occurrence statistics
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Minimum error rate training in statistical machine translation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Evaluation metrics for generation
INLG '00 Proceedings of the first international conference on Natural language generation - Volume 14
Learning the meaning and usage of time phrases from a parallel text-data corpus
HLT-NAACL-LWM '04 Proceedings of the HLT-NAACL 2003 workshop on Learning word meaning from non-linguistic data - Volume 6
Robust PCFG-based generation using automatically acquired LFG approximations
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
HLT '02 Proceedings of the second international conference on Human Language Technology Research
That's nice... what can you do with it?
Computational Linguistics
Automatic generation of textual summaries from neonatal intensive care data
Artificial Intelligence
Natural Language Engineering
Generating basic skills reports for low-skilled readers*
Natural Language Engineering
Intrinsic vs. extrinsic evaluation measures for referring expression generation
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
The TUNA-REG Challenge 2009: overview and evaluation results
ENLG '09 Proceedings of the 12th European Workshop on Natural Language Generation
Acquiring correct knowledge for natural language generation
Journal of Artificial Intelligence Research
Further meta-evaluation of machine translation
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Evaluating coverage for large symbolic NLG grammars
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Choosing words in computer-generated weather forecasts
Artificial Intelligence - Special volume on connecting language to the world
Generating and evaluating evaluative arguments
Artificial Intelligence
DUC 2005: evaluation of question-focused summarization systems
SumQA '06 Proceedings of the Workshop on Task-Focused Summarization and Question Answering
Building a large-scale commercial NLG system for an EMR
INLG '08 Proceedings of the Fifth International Natural Language Generation Conference
The TUNA challenge 2008: overview and evaluation results
INLG '08 Proceedings of the Fifth International Natural Language Generation Conference
Evaluating evaluation methods for generation in the presence of variation
CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Correlating human and automatic evaluation of a German surface realiser
ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Phrase-based statistical language generation using graphical models and active learning
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Further meta-evaluation of broad-coverage surface realization
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Towards an extrinsic evaluation of referring expressions in situated dialogs
INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Assessing the trade-off between system building cost and output quality in data-to-text generation
Empirical methods in natural language generation
Introducing shared tasks to NLG: the TUNA shared task evaluation challenges
Empirical methods in natural language generation
ENLG '11 Proceedings of the 13th European Workshop on Natural Language Generation
Generation of formal and informal sentences
ENLG '11 Proceedings of the 13th European Workshop on Natural Language Generation
Midge: generating image descriptions from computer vision detections
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Learning preferences for referring expression generation: effects of domain, language and algorithm
INLG '12 Proceedings of the Seventh International Natural Language Generation Conference
Artificial Intelligence in Medicine
Information Processing and Management: an International Journal
Artificial Speech and Its Authors
Minds and Machines
Framing image description as a ranking task: data, models and evaluation metrics
Journal of Artificial Intelligence Research
A task-performance evaluation of referring expressions in situated collaborative task dialogues
Language Resources and Evaluation
Hi-index | 0.00 |
There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous work on NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE) correlate with human judgments in the domain of computer-generated weather forecasts. Our results suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as we would ideally like to see; however, they do not provide a useful measure of content quality. We also discuss a number of caveats which must be kept in mind when interpreting this and other validation studies.