An empirical evaluation of a statistical dialog system in public use

Authors:
Jason D. Williams
Affiliations:
AT&T Labs - Research, Shannon Laboratory, Florham Park, NJ
Venue:
SIGDIAL '11 Proceedings of the SIGDIAL 2011 Conference
Year:
2011

Citing 5
Cited 2

Partially observable Markov decision processes for spoken dialog systems

Computer Speech and Language
Mixture model POMDPs for efficient handling of uncertainty in dialogue management

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management

Computer Speech and Language
Estimating probability of correctness for ASR N-best lists

SIGDIAL '09 Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Bayesian update of dialogue state: A POMDP framework for spoken dialogue systems

Computer Speech and Language

A belief tracking challenge task for spoken dialog systems

SDCTD '12 NAACL-HLT Workshop on Future Directions and Needs in the Spoken Dialog Community: Tools and Data
Integrating incremental speech recognition and POMDP-based dialogue systems

SIGDIAL '12 Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper provides a first assessment of a statistical dialog system in public use. In our dialog system there are four main recognition tasks, or slots -- bus route names, bus-stop locations, dates, and times. Whereas a conventional system tracks a single value for each slot -- i.e., the speech recognizer's top hypothesis -- our statistical system tracks a distribution of many possible values over each slot. Past work in lab studies has showed that this distribution improves robustness to speech recognition errors; but to our surprise, we found the distribution yielded an increase in accuracy for only two of the four slots, and actually decreased accuracy in the other two. In this paper, we identify root causes for these differences in performance, including intrinsic properties of N-best lists, parameter settings, and the quality of statistical models. We synthesize our findings into a set of guidelines which aim to assist researchers and practitioners employing statistical techniques in future dialog systems.