A multimodal learning interface for grounding spoken language in sensory perceptions

Authors:
Chen Yu;Dana H. Ballard
Affiliations:
University of Rochester, Rochester, NY;University of Rochester, Rochester, NY
Venue:
ACM Transactions on Applied Perception (TAP)
Year:
2004

Citing 15
Cited 23

Review of neural networks for speech recognition

Neural Computation
The symbol grounding problem

CNLS '89 Proceedings of the ninth annual international conference of the Center for Nonlinear Studies on Self-organizing, Collective, and Cooperative Phenomena in Natural and Artificial Computing Networks on Emergent computation
Color indexing

International Journal of Computer Vision
SEEMORE: combining color, shape, and texture histogramming in a neurally inspired approach to visual object recognition

Neural Computation
Recognition without Correspondence using MultidimensionalReceptive Field Histograms

International Journal of Computer Vision
Clustering Algorithms

Clustering Algorithms
Seeded Region Growing

IEEE Transactions on Pattern Analysis and Machine Intelligence
Understanding Human Behaviors Based on Eye-Head-Hand Coordination

BMCV '02 Proceedings of the Second International Workshop on Biologically Motivated Computer Vision
Invariant features for 3-D gesture recognition

FG '96 Proceedings of the 2nd International Conference on Automatic Face and Gesture Recognition (FG '96)
Attentional Object Spotting by Integrating Multimodal Input

ICMI '02 Proceedings of the 4th IEEE International Conference on Multimodal Interfaces
Learning to Recognize Human Action Sequences

ICDL '02 Proceedings of the 2nd International Conference on Development and Learning
When push comes to shove: a computational model of the role of motor control in the acquisition of action verbs

When push comes to shove: a computational model of the role of motor control in the acquisition of action verbs
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic

Journal of Artificial Intelligence Research
Image segmentation with ratio cut

IEEE Transactions on Pattern Analysis and Machine Intelligence

ViewPointer: lightweight calibration-free eye tracking for ubiquitous handsfree deixis

Proceedings of the 18th annual ACM symposium on User interface software and technology
Modeling embodied visual behaviors

ACM Transactions on Applied Perception (TAP)
Multimodal human-computer interaction: A survey

Computer Vision and Image Understanding
Robotic vocabulary building using extension inference and implicit contrast

Artificial Intelligence
Language Label Learning for Visual Concepts Discovered from Video Sequences

Attention in Cognitive Systems. Theories and Systems from an Interdisciplinary Viewpoint
Voice enabling mobile financial services with multimodal transformation

International Journal of Mobile Communications
Societal grounding is essential to meaningful language use

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Visual data mining of multimedia data for social and behavioral studies

Information Visualization
Learning to interpret utterances using dialogue history

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
The Use of Multimodal Representation in Icon Interpretation

EPCE '09 Proceedings of the 8th International Conference on Engineering Psychology and Cognitive Ergonomics: Held as Part of HCI International 2009
Incorporating temporal and semantic information with eye gaze for automatic word acquisition in multimodal conversational systems

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A robot that uses existing vocabulary to infer non-visual word meanings from observation

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Affordance based word-to-meaning association

ICRA'09 Proceedings of the 2009 IEEE international conference on Robotics and Automation
The role of interactivity in human-machine conversation for automatic word acquisition

SIGDIAL '09 Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Context-based word acquisition for situated dialogue in a virtual world

Journal of Artificial Intelligence Research
Integrating domain knowledge with user eye gaze in automated word acquisition for conversational interfaces

Proceedings of the 2010 workshop on Eye gaze in intelligent human machine interaction
Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction

The 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 2
Adaptive eye gaze patterns in interactions with human and artificial agents

ACM Transactions on Interactive Intelligent Systems (TiiS)
Multimodal human computer interaction: a survey

ICCV'05 Proceedings of the 2005 international conference on Computer Vision in Human-Computer Interaction
Detecting actions by integrating sequential symbolic and sub-symbolic information in human activity recognition

MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
SPRING: speech and pronunciation improvement through games, for Hispanic children

Proceedings of the 4th ACM/IEEE International Conference on Information and Communication Technologies and Development
Integrating word acquisition and referential grounding towards physical world interaction

Proceedings of the 14th ACM international conference on Multimodal interaction
Unsupervised language learning for discovered visual concepts

ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a multimodal interface that learns words from natural interactions with users. In light of studies of human language development, the learning system is trained in an unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. The system collects acoustic signals in concert with user-centric multisensory information from nonspeech modalities, such as user's perspective video, gaze positions, head directions, and hand movements. A multimodal learning algorithm uses this data to first spot words from continuous speech and then associate action verbs and object names with their perceptually grounded meanings. The central ideas are to make use of nonspeech contextual information to facilitate word spotting, and utilize body movements as deictic references to associate temporally cooccurring data from different modalities and build hypothesized lexical items. From those items, an EM-based method is developed to select correct word--meaning pairs. Successful learning is demonstrated in the experiments of three natural tasks: "unscrewing a jar," "stapling a letter," and "pouring water."