Grounded spoken language acquisition: experiments in word learning

Authors:
D. Roy
Affiliations:
Media Lab., Massachusetts Inst. of Technol., Cambridge, MA, USA
Venue:
IEEE Transactions on Multimedia
Year:
2003

Citing 0
Cited 9

Towards Visually-Grounded Spoken Language Acquisition

ICMI '02 Proceedings of the 4th IEEE International Conference on Multimodal Interfaces
Multimodal new vocabulary recognition through speech and handwriting in a whiteboard scheduling application

Proceedings of the 10th international conference on Intelligent user interfaces
Center Fragments for Upscaling and Verification in Database Semantics

Proceedings of the 2009 conference on Information Modelling and Knowledge Bases XX
Evolutionary Intelligence and Communication in Societies of Virtually Embodied Agents

ACAL '09 Proceedings of the 4th Australian Conference on Artificial Life: Borrowing from Biology
How to build consciousness into a robot: the sensorimotor approach

50 years of artificial intelligence
Latent semantic description of iconic scenes

BVAI'05 Proceedings of the First international conference on Brain, Vision, and Artificial Intelligence
Computational modeling of phonetic and lexical learning in early language acquisition: Existing models and future directions

Speech Communication
Incremental word learning: Efficient HMM initialization and large margin discriminative adaptation

Speech Communication
An extensible language interfacefor robot manipulation

AGI'12 Proceedings of the 5th international conference on Artificial General Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Language is grounded in sensory-motor experience. Grounding connects concepts to the physical world enabling humans to acquire and use words and sentences in context. Currently most machines which process language are not grounded. Instead, semantic representations are abstract, pre-specified, and have meaning only when interpreted by humans. We are interested in developing computational systems which represent words, utterances, and underlying concepts in terms of sensory-motor experiences leading to richer levels of machine understanding. A key element of this work is the development of effective architectures for processing multisensory data. Inspired by theories of infant cognition, we present a computational model which learns words from untranscribed acoustic and video input. Channels of input derived from different sensors are integrated in an information-theoretic framework. Acquired words are represented in terms of associations between acoustic and visual sensory experience. The model has been implemented in a real-time robotic system which performs interactive language learning and understanding. Successful learning has also been demonstrated using infant-directed speech and images.