Multimodal error correction for speech user interfaces

  • Authors:
  • Bernhard Suhm;Brad Myers;Alex Waibel

  • Affiliations:
  • BBN Technologies, Cambridge, MA;Carnegie Mellon Univ., Pittsburgh, PA;Carnegie Mellon Univ. and Karlsruhe Univ. (Germany), Pittsburgh, PA

  • Venue:
  • ACM Transactions on Computer-Human Interaction (TOCHI)
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Although commercial dictation systems and speech-enabled telephone voice user interfaces have become readily available, speech recognition errors remain a serious problem in the design and implementation of speech user interfaces. Previous work hypothesized that switching modality could speed up interactive correction of recognition errors. This article presents multimodal error correction methods that allow the user to correct recognition errors efficiently without keyboard input. Correction accuracy is maximized by novel recognition algorithms that use context information for recognizing correction input. Multimodal error correction is evaluated in the context of a prototype multimodal dictation system. The study shows that unimodal repair is less accurate than multimodal error correction. On a dictation task, multimodal correction is faster than unimodal correction by respeaking. The study also provides empirical evidence that system-initiated error correction (based on confidence measures) may not expedite error correction. Furthermore, the study suggests that recognition accuracy determines user choice between modalities: while users initially prefer speech, they learn to avoid ineffective correction modalities with experience. To extrapolate results from this user study, the article introduces a performance model of (recognition-based) multimodal interaction that predicts input speed including time needed for error correction. Applied to interactive error correction, the model predicts the impact of improvements in recognition technology on correction speeds, and the influence of recognition accuracy and correction method on the productivity of dictation systems. This model is a first step toward formalizing multimodal interaction.