Which system differences matter?: using l1/l2 regularization to compare dialogue systems

Authors:
José P. González-Brenes;Jack Mostow
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
SIGDIAL '11 Proceedings of the SIGDIAL 2011 Conference
Year:
2011

Citing 9
Cited 1

Towards developing general models of usability with PARADISE

Natural Language Engineering
Experiments in evaluating interactive spoken language systems

HLT '91 Proceedings of the workshop on Speech and Natural Language
Empirical methods for evaluating dialog systems

ELDS '01 Proceedings of the workshop on Evaluation for Language and Dialogue Systems - Volume 9
The PARADISE Evaluation Framework: Issues and Findings

Computational Linguistics
Predicting the quality and usability of spoken dialogue services

Speech Communication
Olympus: an open-source framework for conversational spoken language interface research

NAACL-HLT-Dialog '07 Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies
Automatically training a problematic dialogue predictor for a spoken dialogue system

Journal of Artificial Intelligence Research
Multi-population GWA mapping via multi-task regularized regression

Bioinformatics
Classifying dialogue in high-dimensional space

ACM Transactions on Speech and Language Processing (TSLP)

"Love ya, jerkface": using sparse log-linear models to build positive (and impolite) relationships with teens

SIGDIAL '12 Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate how to jointly explain the performance and behavioral differences of two spoken dialogue systems. The Join Evaluation and Differences Identification (JEDI), finds differences between systems relevant to performance by formulating the problem as a multi-task feature selection question. JEDI provides evidence on the usefulness of a recent method, l1/lp-regularized regression (Obozinski et al., 2007). We evaluate against manually annotated success criteria from real users interacting with five different spoken user interfaces that give bus schedule information.