Compact bag-of-words visual representation for effective linear classification

Authors:
Xiaodan Zhuang;Shuang Wu;Pradeep Natarajan
Affiliations:
Raytheon BBN Technologies, Cambridge, MA, USA;Raytheon BBN Technologies, Cambridge, MA, USA;Raytheon BBN Technologies, Cambridge, MA, USA
Venue:
Proceedings of the 21st ACM international conference on Multimedia
Year:
2013

Citing 6
Cited 0

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
SIFT-Bag kernel for video event analysis

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Visual Word Ambiguity

IEEE Transactions on Pattern Analysis and Machine Intelligence
Improving the fisher kernel for large-scale image classification

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Front-End Factor Analysis for Speaker Verification

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bag-of-words approaches have been shown to achieve state-of-the-art performance in large-scale multimedia event detection. However, the commonly used histogram representation of bag-of-words requires large codebook sizes and expensive nonlinear kernel based classifiers for optimal performance. To address these two issues, we present a two-part generative model for compact visual representation, based on the i-vector approach recently proposed for speech and audio modeling. First, we use a Gaussian mixture model (GMM) to model the joint distribution of local descriptors. Second, we use a low-dimensional factor representation that constrains the GMM parameters to a subspace that preserves most of the information. We further extend this method to incorporate overlapping spatial regions, forming a highly compact visual representation that achieves superior performance with fast linear classifiers. We evaluate the method on a large video dataset used in the TRECVID 2011 MED evaluation. With linear classifiers, the proposed representation, with one-tenth of the storage footprint, outperforms soft quantization histograms used in the top performing TRECVID 2011 MED systems.