Visual dictionary learning for joint object categorization and segmentation

Authors:
Aastha Jain;Luca Zappella;Patrick McClure;René Vidal
Affiliations:
Center for Imaging Science, Johns Hopkins University;Center for Imaging Science, Johns Hopkins University;Center for Imaging Science, Johns Hopkins University;Center for Imaging Science, Johns Hopkins University
Venue:
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part V
Year:
2012

Citing 15
Cited 1

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Video Google: A Text Retrieval Approach to Object Matching in Videos

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
What Energy Functions Can Be Minimizedvia Graph Cuts?

IEEE Transactions on Pattern Analysis and Machine Intelligence
Large Margin Methods for Structured and Interdependent Output Variables

The Journal of Machine Learning Research
Object Categorization by Learned Universal Visual Dictionary

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Design and Performance of a Fault-Tolerant Real-Time CORBA Event Service

ECRTS '06 Proceedings of the 18th Euromicro Conference on Real-Time Systems
Segmentation and Recognition Using Structure from Motion Point Clouds

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part I
Robust Higher Order Potentials for Enforcing Label Consistency

International Journal of Computer Vision
Learning structural SVMs with latent variables

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Graph cut based inference with co-occurrence statistics

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part V
Supervised label transfer for semantic segmentation of street scenes

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part V
Loopy belief propagation for approximate inference: an empirical study

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Using global bag of features models in random fields for joint categorization and segmentation of objects

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Top-down visual saliency via joint CRF and dictionary learning

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Joint 2D-3D temporally consistent semantic segmentation of street scenes

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Sparse reconstruction for weakly supervised semantic segmentation

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Representing objects using elements from a visual dictionary is widely used in object detection and categorization. Prior work on dictionary learning has shown improvements in the accuracy of object detection and categorization by learning discriminative dictionaries. However none of these dictionaries are learnt for joint object categorization and segmentation. Moreover, dictionary learning is often done separately from classifier training, which reduces the discriminative power of the model. In this paper, we formulate the semantic segmentation problem as a joint categorization, segmentation and dictionary learning problem. To that end, we propose a latent conditional random field (CRF) model in which the observed variables are pixel category labels and the latent variables are visual word assignments. The CRF energy consists of a bottom-up segmentation cost, a top-down bag of (latent) words categorization cost, and a dictionary learning cost. Together, these costs capture relationships between image features and visual words, relationships between visual words and object categories, and spatial relationships among visual words. The segmentation, categorization, and dictionary learning parameters are learnt jointly using latent structural SVMs, and the segmentation and visual words assignments are inferred jointly using energy minimization techniques. Experiments on the Graz02 and CamVid datasets demonstrate the performance of our approach.