Using Language to Drive the Perceptual Grouping of Local Image Features

Authors:
Michael Jamieson;Sven Dickinson;Suzanne Stevenson;Sven Wachsmuth
Affiliations:
University of Toronto;University of Toronto;University of Toronto;Bielefeld University
Venue:
CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Year:
2006

Citing 0
Cited 4

Learning Visual Compound Models from Parallel Image-Text Datasets

Proceedings of the 30th DAGM symposium on Pattern Recognition
Contextual image annotation via projection and quantum theory inspired measurement for integration of text and visual features

QI'11 Proceedings of the 5th international conference on Quantum interaction
Context-Aware Semi-Local Feature Detector

ACM Transactions on Intelligent Systems and Technology (TIST)
Discovering hierarchical object models from captioned images

Computer Vision and Image Understanding

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of learning both the semantics (names) and the visual features (SIFT collections) of objects appearing in a training set of unstructured, captioned images of cluttered scenes. Prior work in applying machine translation models to learn the associations between image features and caption nouns has assumed a one-toone correspondence between features and nouns. However, each training image may contain thousands of SIFT features belonging to multiple objects. Our challenge is two-fold: 1) grouping the SIFT features into meaningful collections, and 2) learning the object names associated with those collections. Since better collections tend to have stronger associations with object names, we offer an integrated solution that uses the caption words to drive the feature grouping process. The result is a more general model acquisition framework that does not assume words correspond to individual features and does not require training images with isolated objects or unambiguous labels. The model that is learned performs well at labeling cluttered scenes in a set of test images.