Using Language to Drive the Perceptual Grouping of Local Image Features

  • Authors:
  • Michael Jamieson;Sven Dickinson;Suzanne Stevenson;Sven Wachsmuth

  • Affiliations:
  • University of Toronto;University of Toronto;University of Toronto;Bielefeld University

  • Venue:
  • CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

We address the problem of learning both the semantics (names) and the visual features (SIFT collections) of objects appearing in a training set of unstructured, captioned images of cluttered scenes. Prior work in applying machine translation models to learn the associations between image features and caption nouns has assumed a one-toone correspondence between features and nouns. However, each training image may contain thousands of SIFT features belonging to multiple objects. Our challenge is two-fold: 1) grouping the SIFT features into meaningful collections, and 2) learning the object names associated with those collections. Since better collections tend to have stronger associations with object names, we offer an integrated solution that uses the caption words to drive the feature grouping process. The result is a more general model acquisition framework that does not assume words correspond to individual features and does not require training images with isolated objects or unambiguous labels. The model that is learned performs well at labeling cluttered scenes in a set of test images.