Integration of speech and vision using mutual information

Authors:
D. Roy
Affiliations:
Media Lab., MIT, Cambridge, MA, USA
Venue:
ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 04
Year:
2000

Citing 0
Cited 7

Language acquisition through a human-robot interface by combining speech, visual, and behavioral information

Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Spoken language analysis, modeling and recognition-statistical and adaptive connectionist approaches
Implementation of biases observed in child development into concept learning agent

AIA'06 Proceedings of the 24th IASTED international conference on Artificial intelligence and applications
Grounded perceptual schemas: developmental acquisition of spatial concepts

SC'06 Proceedings of the 2006 international conference on Spatial Cognition V: reasoning, action, interaction
Mutual information as a variable to differentiate the roles of gaze in the multimodal interface

Proceedings of the 2010 workshop on Eye gaze in intelligent human machine interaction
Dialog strategy acquisition and its evaluation for efficient learning of word meanings by agents

EELC'06 Proceedings of the Third international conference on Emergence and Evolution of Linguistic Communication: symbol Grounding and Beyond
Implementation of biases observed in children's language development into agents

EELC'06 Proceedings of the Third international conference on Emergence and Evolution of Linguistic Communication: symbol Grounding and Beyond
Robots that learn language: developmental approach to human-machine conversations

EELC'06 Proceedings of the Third international conference on Emergence and Evolution of Linguistic Communication: symbol Grounding and Beyond

Quantified Score

Hi-index	0.00

Visualization

Abstract

We are developing a system which learns words from co-occurring spoken and visual input. The goal is to automatically segment continuous speech at word boundaries without a lexicon, and to form visual categories which correspond to spoken words. Mutual information is used to integrate acoustic and visual distance metrics in order to extract an audio-visual lexicon from raw input. We report results of experiments with a corpus of infant-directed speech and images.