Measuring usability: preference vs. performance
Communications of the ACM
PARADISE: a framework for evaluating spoken dialogue agents
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Computer Animation and Virtual Worlds
BLEU: a method for automatic evaluation of machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic evaluation of summaries using N-gram co-occurrence statistics
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Multimodal generation in the COMIC dialogue system
ACLdemo '05 Proceedings of the ACL 2005 on Interactive poster and demonstration sessions
Intrinsic vs. extrinsic evaluation measures for referring expression generation
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Avoiding repetition in generated text
ENLG '07 Proceedings of the Eleventh European Workshop on Natural Language Generation
Associating facial displays with syntactic constituents for generation
LAW '07 Proceedings of the Linguistic Annotation Workshop
Evaluating evaluation methods for generation in the presence of variation
CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
The software architecture for the first challenge on generating instructions in virtual environments
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session
Report on the first NLG Challenge on Generating Instructions in Virtual Environments (GIVE)
ENLG '09 Proceedings of the 12th European Workshop on Natural Language Generation
Comparing objective and subjective measures of usability in a human-robot dialogue system
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
User Modeling and User-Adapted Interaction
INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Towards an extrinsic evaluation of referring expressions in situated dialogs
INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Introducing shared tasks to NLG: the TUNA shared task evaluation challenges
Empirical methods in natural language generation
The first challenge on generating instructions in virtual environments
Empirical methods in natural language generation
Assessing user simulation for dialog systems using human judges and automatic evaluation measures
Natural Language Engineering
Hi-index | 0.00 |
When evaluating a generation system, if a corpus of target outputs is available, a common and simple strategy is to compare the system output against the corpus contents. However, cross-validation metrics that test whether the system makes exactly the same choices as the corpus on each item have recently been shown not to correlate well with human judgements of quality. An alternative evaluation strategy is to compute intrinsic, task-specific properties of the generated output; this requires more domain-specific metrics, but can often produce a better assessment of the output. In this paper, a range of metrics using both of these techniques are used to evaluate three methods for selecting the facial displays of an embodied conversational agent, and the predictions of the metrics are compared with human judgements of the same generated output. The corpus-reproduction metrics show no relationship with the human judgements, while the intrinsic metrics that capture the number and variety of facial displays show a significant correlation with the preferences of the human users.