Embodied Active Vision in Language Learning and Grounding

  • Authors:
  • Chen Yu

  • Affiliations:
  • Indiana University, Bloomington IN 47401, USA

  • Venue:
  • Attention in Cognitive Systems. Theories and Systems from an Interdisciplinary Viewpoint
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most cognitive studies of language acquisition in both natural systems and artificial systems have focused on the role of purely linguistic information as the central constraint. However, we argue that non-linguistic information, such as vision and talkers' attention, also plays a major role in language acquisition. To support this argument, this chapter reports two studies of embodied language learning --- one on natural intelligence and one on artificial intelligence. First, we developed a novel method that seeks to describe the visual learning environment from a young child's point of view. A multi-camera sensing environment is built which consists of two head-mounted mini cameras that are placed on both the child's and the parent's foreheads respectively. The major result is that the child uses their body to constrain the visual information s/he perceives and by doing so adapts to an embodied solution to deal with the reference uncertainty problem in language learning. In our second study, we developed a learning system trained in an unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. The system collects acoustic signals in concert with user-centric multisensory information from non-speech modalities, such as user's perspective video, gaze positions, head directions and hand movements. A multimodal learning algorithm uses this data to first spot words from continuous speech and then associate action verbs and object names with their perceptually grounded meanings. Similar to human learners, the central ideas of our computational system are to make use of non-speech contextual information to facilitate word spotting, and utilize body movements as deictic references to associate temporally co-occurring data from different modalities and build a visually grounded lexicon.