A multimodal learning interface for grounding spoken language in sensory perceptions

  • Authors:
  • Chen Yu;Dana H. Ballard

  • Affiliations:
  • University of Rochester, Rochester, NY;University of Rochester, Rochester, NY

  • Venue:
  • ACM Transactions on Applied Perception (TAP)
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a multimodal interface that learns words from natural interactions with users. In light of studies of human language development, the learning system is trained in an unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. The system collects acoustic signals in concert with user-centric multisensory information from nonspeech modalities, such as user's perspective video, gaze positions, head directions, and hand movements. A multimodal learning algorithm uses this data to first spot words from continuous speech and then associate action verbs and object names with their perceptually grounded meanings. The central ideas are to make use of nonspeech contextual information to facilitate word spotting, and utilize body movements as deictic references to associate temporally cooccurring data from different modalities and build hypothesized lexical items. From those items, an EM-based method is developed to select correct word--meaning pairs. Successful learning is demonstrated in the experiments of three natural tasks: "unscrewing a jar," "stapling a letter," and "pouring water."