Visual focus of attention in adaptive language acquisition

Authors:
Ananth Sankar;Allen Gorin
Affiliations:
AT&T Bell Laboratories, Murray Hill, NJ;AT&T Bell Laboratories, Murray Hill, NJ
Venue:
ICASSP'93 Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: plenary, special, audio, underwater acoustics, VLSI, neural networks - Volume I
Year:
1993

Citing 3
Cited 1

What size net gives valid generalization?

Neural Computation
Adaptive acquisition of spoken language

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
A structured network architecture for adaptive language acquisition

ICASSP'92 Proceedings of the 1992 IEEE international conference on Acoustics, speech and signal processing - Volume 1

Some experiments in spoken language acquisition

ICASSP'93 Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: plenary, special, audio, underwater acoustics, VLSI, neural networks - Volume I

Quantified Score

Hi-index	0.00

Visualization

Abstract

In our research on Adaptive Language Acquisition, we have been investigating connectionist systems that learn the mapping from a message to a meaningful machine action through interaction with a complex environment. Previously, the only input to these systems has been the message. However, in many devices of interest, the action also depends on the state of the world, thereby motivating the study of systems with multisensory input. In this work, we describe and evaluate a device which acquires language through interaction with an environment which provides both keyboard and visual input. In particular, the machine action is to focus its attention, by directing its eyeball toward one of many blocks of different colors and shapes, in response to a message such as "Look at the red square". The attention focus is controlled by minimizing a time-varying potential function that correlates the message and visual input. This correlation is factored through color and shape sensory primitive subnetworks in an information-theoretic connectionist network, allowing the machine to generalize between different objects having the same color or shape. The system runs in a conversational mode where the user can provide clarifying messages and error feedback until the system responds correctly. During the course of performing its task, a vocabulary of 431 words was acquired from 11 users in over 1000 unconstrained natural language conversations. The average number of inputs for the machine to respond correctly was only 1.4 sentences, and it retained 98% of what it was taught.