Review of neural networks for speech recognition
Neural Computation
International Journal of Computer Vision
Finding generalized projected clusters in high dimensional spaces
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Recognition without Correspondence using MultidimensionalReceptive Field Histograms
International Journal of Computer Vision
Clustering Algorithms
IEEE Transactions on Pattern Analysis and Machine Intelligence
Invariant features for 3-D gesture recognition
FG '96 Proceedings of the 2nd International Conference on Automatic Face and Gesture Recognition (FG '96)
Learning to Recognize Human Action Sequences
ICDL '02 Proceedings of the 2nd International Conference on Development and Learning
The mathematics of statistical machine translation: parameter estimation
Computational Linguistics - Special issue on using large corpora: II
Image segmentation with ratio cut
IEEE Transactions on Pattern Analysis and Machine Intelligence
Proceedings of the 10th international conference on Intelligent user interfaces
Using redundant speech and handwriting for learning new vocabulary and understanding abbreviations
Proceedings of the 8th international conference on Multimodal interfaces
Exploiting eye-hand coordination to detect grasping movements
Image and Vision Computing
Hi-index | 0.00 |
Most speech interfaces are based on natural language processing techniques that use pre-defined symbolic representations of word meanings and process only linguistic information. To understand and use language like their human counterparts in multimodal human-computer interaction, computers need to acquire spoken language and map it to other sensory perceptions. This paper presents a multimodal interface that learns to associate spoken language with perceptual features by being situated in users' everyday environments and sharing user-centric multisensory information. The learning interface is trained in unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. We collect acoustic signals in concert with multisensory information from non-speech modalities, such as user's perspective video, gaze positions, head directions and hand movements. The system firstly estimates users' focus of attention from eye and head cues. Attention, as represented by gaze fixation, is used for spotting the target object of user interest. Attention switches are calculated and used to segment an action sequence into action units which are then categorized by mixture hidden Markov models. A multimodal learning algorithm is developed to spot words from continuous speech and then associate them with perceptually grounded meanings extracted from visual perception and action. Successful learning has been demonstrated in the experiments of three natural tasks: "unscrewing a jar", "stapling a letter" and "pouring water".