A multimodal learning interface for grounding spoken language in sensory perceptions

Authors:
Chen Yu;Dana H. Ballard
Affiliations:
University of Rochester, Rochester, NY;University of Rochester, Rochester, NY
Venue:
Proceedings of the 5th international conference on Multimodal interfaces
Year:
2003

Citing 11
Cited 3

Review of neural networks for speech recognition

Neural Computation
Color indexing

International Journal of Computer Vision
SEEMORE: combining color, shape, and texture histogramming in a neurally inspired approach to visual object recognition

Neural Computation
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Recognition without Correspondence using MultidimensionalReceptive Field Histograms

International Journal of Computer Vision
Clustering Algorithms

Clustering Algorithms
Seeded Region Growing

IEEE Transactions on Pattern Analysis and Machine Intelligence
Invariant features for 3-D gesture recognition

FG '96 Proceedings of the 2nd International Conference on Automatic Face and Gesture Recognition (FG '96)
Learning to Recognize Human Action Sequences

ICDL '02 Proceedings of the 2nd International Conference on Development and Learning
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Image segmentation with ratio cut

IEEE Transactions on Pattern Analysis and Machine Intelligence

Multimodal new vocabulary recognition through speech and handwriting in a whiteboard scheduling application

Proceedings of the 10th international conference on Intelligent user interfaces
Using redundant speech and handwriting for learning new vocabulary and understanding abbreviations

Proceedings of the 8th international conference on Multimodal interfaces
Exploiting eye-hand coordination to detect grasping movements

Image and Vision Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most speech interfaces are based on natural language processing techniques that use pre-defined symbolic representations of word meanings and process only linguistic information. To understand and use language like their human counterparts in multimodal human-computer interaction, computers need to acquire spoken language and map it to other sensory perceptions. This paper presents a multimodal interface that learns to associate spoken language with perceptual features by being situated in users' everyday environments and sharing user-centric multisensory information. The learning interface is trained in unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. We collect acoustic signals in concert with multisensory information from non-speech modalities, such as user's perspective video, gaze positions, head directions and hand movements. The system firstly estimates users' focus of attention from eye and head cues. Attention, as represented by gaze fixation, is used for spotting the target object of user interest. Attention switches are calculated and used to segment an action sequence into action units which are then categorized by mixture hidden Markov models. A multimodal learning algorithm is developed to spot words from continuous speech and then associate them with perceptually grounded meanings extracted from visual perception and action. Successful learning has been demonstrated in the experiments of three natural tasks: "unscrewing a jar", "stapling a letter" and "pouring water".