Towards adaptive object recognition for situated human-computer interaction

  • Authors:
  • Kate Saenko;Trevor Darrell

  • Affiliations:
  • Massachusets Institute of Technology, Cambridge, MA;Massachusets Institute of Technology, Cambridge, MA

  • Venue:
  • Proceedings of the 2007 workshop on Multimodal interfaces in semantic interaction
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Object recognition is an important part of human-computer interaction in situated environments, such as a home or an office. Especially useful is category-level recognition (e.g., recognizing the class of chairs, as opposed to a particular chair.) While humans can employ multimodal cues for cate-gorizing objects during situated conversational interactions, most computer algorithms currently rely on vision-only or speech-only recognition. We are developing a method for learning about physical objects found in a situated environment based on visual and spoken input provided by the user. The algorithm operates on generic databases of labeled object images and transcribed speech data, plus unlabeled audio and images of a user refering to objects in the environment. By exploiting the generic labeled databases, the algorithm would associate probable object-referring words with probable visual representations of those objects, and use both modalities to determine the object label. The first advantage of this approach over visual-only or speech-only recognition is the ability to disambiguate object categories using complementary information sources. The second advantage is that, using the additional unlabeled data gathered during the interaction, the system can potentially improve its recognition of new category instances in the physical environment in which it is situated, as well as of new utterances spoken by the same user, compared to a system that uses only the generic labeled databases. It can achieve this by adapting its generic object classifiers and its generic speech and language models