Augmenting user interfaces with adaptive speech commands

Authors:
Peter Gorniak;Deb Roy
Affiliations:
MIT Media Laboratory, Cambridge, MA;MIT Media Laboratory, Cambridge, MA
Venue:
Proceedings of the 5th international conference on Multimodal interfaces
Year:
2003

Citing 5
Cited 5

Discrete-time signal processing

Discrete-time signal processing
Large-vocabulary speaker-independent continuous speech recognition: the sphinx system

Large-vocabulary speaker-independent continuous speech recognition: the sphinx system
A visually grounded natural language interface for reference to spatial scenes

Proceedings of the 5th international conference on Multimodal interfaces
Statistical models for topic identification using phoneme substrings

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Designing the user interface for multimodal speech and pen-based gesture applications: state-of-the-art systems and future research directions

Human-Computer Interaction

A visually grounded natural language interface for reference to spatial scenes

Proceedings of the 5th international conference on Multimodal interfaces
Multimodal new vocabulary recognition through speech and handwriting in a whiteboard scheduling application

Proceedings of the 10th international conference on Intelligent user interfaces
Using redundant speech and handwriting for learning new vocabulary and understanding abbreviations

Proceedings of the 8th international conference on Multimodal interfaces
Grounded semantic composition for visual scenes

Journal of Artificial Intelligence Research
PixelTone: a multimodal interface for image editing

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a system that augments any unmodified Java application with an adaptive speech interface. The augmented system learns to associate spoken words and utterances with interface actions such as button clicks. Speech learning is constantly active and searches for correlations between what the user says and does. Training the interface is seamlessly integrated with using the interface. As the user performs normal actions, she may optionally verbally describe what she is doing. By using a phoneme recognizer, the interface is able to quickly learn new speech commands. Speech commands are chosen by the user and can be recognized robustly due to accurate phonetic modelling of the user's utterances and the small size of the vocabulary learned for a single application. After only a few examples, speech commands can replace mouse clicks. In effect, selected interface functions migrate from keyboard and mouse to speech. We demonstrate the usefulness of this approach by augmenting jfig, a drawing application, where speech commands save the user from the distraction of having to use a tool palette.