Maintaining knowledge about temporal intervals
Communications of the ACM
Evaluating Natural Language Processing Systems: An Analysis and Review
Evaluating Natural Language Processing Systems: An Analysis and Review
Artificial Intelligence
Lessons from a failure: generating tailored smoking cessation letters
Artificial Intelligence
Developing and empirically evaluating robust explanation generators: the KNIGHT experiments
Computational Linguistics
BLEU: a method for automatic evaluation of machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic evaluation of summaries using N-gram co-occurrence statistics
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Adaptive, intelligent presentation of information for the museum visitor in PEACH
User Modeling and User-Adapted Interaction
Automatic generation of textual summaries from neonatal intensive care data
Artificial Intelligence
Correlation between ROUGE and human evaluation of extractive meeting summaries
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Evaluating coverage for large symbolic NLG grammars
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Choosing words in computer-generated weather forecasts
Artificial Intelligence - Special volume on connecting language to the world
INLG '08 Proceedings of the Fifth International Natural Language Generation Conference
INLG '08 Proceedings of the Fifth International Natural Language Generation Conference
Introducing shared tasks to NLG: the TUNA shared task evaluation challenges
Empirical methods in natural language generation
Hi-index | 0.00 |
This paper investigates the relationship between the results of an extrinsic, task-based evaluation of an NLG system and various metrics measuring both surface and deep semantic textual properties, including relevance. The latter rely heavily on domain knowledge. We show that they correlate systematically with some measures of performance. The core argument of this paper is that more domain knowledge-based metrics shed more light on the relationship between deep semantic properties of a text and task performance.