Empirical methods for evaluating dialog systems

Authors:
Tim Paek
Affiliations:
Microsoft Research, Redmond, WA
Venue:
ELDS '01 Proceedings of the workshop on Evaluation for Language and Dialogue Systems - Volume 9
Year:
2001

Citing 5
Cited 7

Empirically evaluating an adaptable spoken dialogue system

UM '99 Proceedings of the seventh international conference on User modeling
A computational architecture for conversation

UM '99 Proceedings of the seventh international conference on User modeling
Designing Interactive Speech Systems: From First Ideas to User Testing

Designing Interactive Speech Systems: From First Ideas to User Testing
Conversation as Action Under Uncertainty

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
PARADISE: a framework for evaluating spoken dialogue agents

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics

Towards human-like spoken dialogue systems

Speech Communication
Relations between de-facto criteria in the evaluation of a spoken dialogue system

Speech Communication
Bootstrapping spoken dialogue systems by exploiting reusable libraries

Natural Language Engineering
Integrating Planning and Dialogue in a Lifestyle Agent

IVA '08 Proceedings of the 8th international conference on Intelligent Virtual Agents
Toward evaluation that leads to best practices: reconciling dialog evaluation in research and industry

NAACL-HLT-Dialog '07 Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies
Human judgment as a parameter in evaluation campaigns

HumanJudge '08 Proceedings of the Workshop on Human Judgements in Computational Linguistics
Which system differences matter?: using l1/l2 regularization to compare dialogue systems

SIGDIAL '11 Proceedings of the SIGDIAL 2011 Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

We examine what purpose a dialog metric serves and then propose empirical methods for evaluating systems that meet that purpose. The methods include a protocol for conducting a wizard-of-oz experiment and a basic set of descriptive statistics for substantiating performance claims using the data collected from the experiment as an ideal benchmark or "gold standard" for making comparative judgments. The methods also provide a practical means of optimizing the system through component analysis and cost valuation.