Tutor-based learning of visual categories using different levels of supervision

  • Authors:
  • Mario Fritz;Geert-Jan M. Kruijff;Bernt Schiele

  • Affiliations:
  • EECS Department, UC Berkeley & ICSI, Berkeley, USA;Language Technology Lab, DFKI GmbH, Saarbrücken, Germany;CS Department, TU-Darmstadt & MPI Informatik, Saarbrüücken, Germany

  • Venue:
  • Computer Vision and Image Understanding
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In recent years we have seen lots of strong work in visual recognition, dialogue interpretation and multi-modal learning that is targeted at provide the building blocks to enable intelligent robots to interact with humans in a meaningful way and even continuously evolve during this process. Building systems that unify those components under a common architecture has turned out to be challenging, as each approach comes with it's own set of assumptions, restrictions, and implications. For example, the impact of recent progress on visual category recognition has been limited from a perspective of interactive systems. Reasons for this are diverse. We identify and address two major challenges in order to integrate modern techniques for visual categorization in an interactive learning system: reducing the number of required labelled training examples and dealing with potentially erroneous input. Today's object categorization methods use either supervised or unsupervised training methods. While supervised methods tend to produce more accurate results, unsupervised methods are highly attractive due to their potential to use far more and unlabeled training data. We proposes a novel method that uses unsupervised training to obtain visual groupings of objects and a cross-modal learning scheme to overcome inherent limitations of purely unsupervised training. The method uses a unified and scale-invariant object representation that allows to handle labeled as well as unlabeled information in a coherent way. First experiments demonstrate the ability of the system to learn object category models from many unlabeled observations and a few dialogue interactions that can be ambiguous or even erroneous.