Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories
CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
LIBLINEAR: A Library for Large Linear Classification
The Journal of Machine Learning Research
SIFT-Bag kernel for video event analysis
MM '08 Proceedings of the 16th ACM international conference on Multimedia
IEEE Transactions on Pattern Analysis and Machine Intelligence
Improving the fisher kernel for large-scale image classification
ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Front-End Factor Analysis for Speaker Verification
IEEE Transactions on Audio, Speech, and Language Processing
Hi-index | 0.00 |
Bag-of-words approaches have been shown to achieve state-of-the-art performance in large-scale multimedia event detection. However, the commonly used histogram representation of bag-of-words requires large codebook sizes and expensive nonlinear kernel based classifiers for optimal performance. To address these two issues, we present a two-part generative model for compact visual representation, based on the i-vector approach recently proposed for speech and audio modeling. First, we use a Gaussian mixture model (GMM) to model the joint distribution of local descriptors. Second, we use a low-dimensional factor representation that constrains the GMM parameters to a subspace that preserves most of the information. We further extend this method to incorporate overlapping spatial regions, forming a highly compact visual representation that achieves superior performance with fast linear classifiers. We evaluate the method on a large video dataset used in the TRECVID 2011 MED evaluation. With linear classifiers, the proposed representation, with one-tenth of the storage footprint, outperforms soft quantization histograms used in the top performing TRECVID 2011 MED systems.