Automated metrics that agree with human judgements on generated output for an embodied conversational agent

Authors:
Mary Ellen Foster
Affiliations:
Technische Universität München, Garching bei München, Germany
Venue:
INLG '08 Proceedings of the Fifth International Natural Language Generation Conference
Year:
2008

Citing 10
Cited 9

Measuring usability: preference vs. performance

Communications of the ACM
PARADISE: a framework for evaluating spoken dialogue agents

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Specifying and animating facial signals for discourse in embodied conversational agents: Research Articles

Computer Animation and Virtual Worlds
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Multimodal generation in the COMIC dialogue system

ACLdemo '05 Proceedings of the ACL 2005 on Interactive poster and demonstration sessions
Intrinsic vs. extrinsic evaluation measures for referring expression generation

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Avoiding repetition in generated text

ENLG '07 Proceedings of the Eleventh European Workshop on Natural Language Generation
Associating facial displays with syntactic constituents for generation

LAW '07 Proceedings of the Linguistic Annotation Workshop
Evaluating evaluation methods for generation in the presence of variation

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

The software architecture for the first challenge on generating instructions in virtual environments

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session
Report on the first NLG Challenge on Generating Instructions in Virtual Environments (GIVE)

ENLG '09 Proceedings of the 12th European Workshop on Natural Language Generation
Comparing objective and subjective measures of usability in a human-robot dialogue system

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
User preferences can drive facial expressions: evaluating an embodied conversational agent in a recommender dialogue system

User Modeling and User-Adapted Interaction
Textual properties and task based evaluation: investigating the role of surface properties, structure and content

INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Towards an extrinsic evaluation of referring expressions in situated dialogs

INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Introducing shared tasks to NLG: the TUNA shared task evaluation challenges

Empirical methods in natural language generation
The first challenge on generating instructions in virtual environments

Empirical methods in natural language generation
Assessing user simulation for dialog systems using human judges and automatic evaluation measures

Natural Language Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

When evaluating a generation system, if a corpus of target outputs is available, a common and simple strategy is to compare the system output against the corpus contents. However, cross-validation metrics that test whether the system makes exactly the same choices as the corpus on each item have recently been shown not to correlate well with human judgements of quality. An alternative evaluation strategy is to compute intrinsic, task-specific properties of the generated output; this requires more domain-specific metrics, but can often produce a better assessment of the output. In this paper, a range of metrics using both of these techniques are used to evaluate three methods for selecting the facial displays of an embodied conversational agent, and the predictions of the metrics are compared with human judgements of the same generated output. The corpus-reproduction metrics show no relationship with the human judgements, while the intrinsic metrics that capture the number and variety of facial displays show a significant correlation with the preferences of the human users.