Discovering hierarchical object models from captioned images

Authors:
Michael Jamieson;Yulia Eskin;Afsaneh Fazly;Suzanne Stevenson;Sven J. Dickinson
Affiliations:
Department of Computer Science, University of Toronto, 6 King's College Rd., Toronto, Ontario, Canada M5S 3G4;Department of Computer Science, University of Toronto, 6 King's College Rd., Toronto, Ontario, Canada M5S 3G4;Department of Computer Science, University of Toronto, 6 King's College Rd., Toronto, Ontario, Canada M5S 3G4;Department of Computer Science, University of Toronto, 6 King's College Rd., Toronto, Ontario, Canada M5S 3G4;Department of Computer Science, University of Toronto, 6 King's College Rd., Toronto, Ontario, Canada M5S 3G4
Venue:
Computer Vision and Image Understanding
Year:
2012

Citing 17
Cited 1

3-D Shape Recovery Using Distributed Aspect Matching

IEEE Transactions on Pattern Analysis and Machine Intelligence - Special issue on interpretation of 3-D scenes—part II
Shock Graphs and Shape Matching

International Journal of Computer Vision
Matching words and pictures

The Journal of Machine Learning Research
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Hierarchical Part-Based Visual Object Categorization

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Feature Hierarchies for Object Classification

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
Using Language to Drive the Perceptual Grouping of Local Image Features

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
The representation and matching of categorical shape

Computer Vision and Image Understanding
Supervised Learning of Semantic Classes for Image Annotation and Retrieval

IEEE Transactions on Pattern Analysis and Machine Intelligence
Modeling Semantic Aspects for Cross-Media Image Indexing

IEEE Transactions on Pattern Analysis and Machine Intelligence
Groups of Adjacent Contour Segments for Object Detection

IEEE Transactions on Pattern Analysis and Machine Intelligence
Unsupervised Structure Learning: Hierarchical Recursive Composition, Suspicious Coincidence and Competitive Exclusion

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part II
Using Language to Learn Structured Appearance Models for Image Annotation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning the Compositional Nature of Visual Object Categories for Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
PCA-SIFT: a more distinctive representation for local image descriptors

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
Model-Based Three-Dimensional Interpretations of Two-Dimensional Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Weakly supervised learning of part-based spatial models for visual object recognition

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part I

Multiview Hessian discriminative sparse coding for image annotation

Computer Vision and Image Understanding

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of automatically learning the recurring associations between the visual structures in images and the words in their associated captions, yielding a set of named object models that can be used for subsequent image annotation. In previous work, we used language to drive the perceptual grouping of local features into configurations that capture small parts (patches) of an object. However, model scope was poor, leading to poor object localization during detection (annotation), and ambiguity was high when part detections were weak. We extend and significantly revise our previous framework by using language to drive the perceptual grouping of parts, each a configuration in the previous framework, into hierarchical configurations that offer greater spatial extent and flexibility. The resulting hierarchical multipart models remain scale, translation and rotation invariant, but are more reliable detectors and provide better localization. Moreover, unlike typical frameworks for learning object models, our approach requires no bounding boxes around the objects to be learned, can handle heavily cluttered training scenes, and is robust in the face of noisy captions, i.e., where objects in an image may not be named in the caption, and objects named in the caption may not appear in the image. We demonstrate improved precision and recall in annotation over the non-hierarchical technique and also show extended spatial coverage of detected objects.