Marginal-based visual alphabets for local image descriptors aggregation

Authors:
Miriam Redi;Bernard Merialdo
Affiliations:
EURECOM, Sophia Antipolis, France;EURECOM, Sophia Antipolis, France
Venue:
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Year:
2011

Citing 6
Cited 2

Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

International Journal of Computer Vision
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Creating Efficient Codebooks for Visual Recognition

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
Scalable Recognition with a Vocabulary Tree

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Evaluation campaigns and TRECVid

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
PCA-SIFT: a more distinctive representation for local image descriptors

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition

Exploring two spaces with one feature: kernelized multidimensional modeling of visual alphabets

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Direct modeling of image keypoints distribution through copula-based image signatures

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval

Quantified Score

Hi-index	0.01

Visualization

Abstract

Bag of Words (BOW) models are nowadays one of the most effective methods for visual categorization. They use visual dictionaries to aggregate the set of local descriptors extracted from a given image. Despite their high discriminative ability, one of the major drawbacks of BOW still remains the computational cost of the visual dictionary, built by clustering in the high dimensional feature space. In this paper we introduce a fast, effective method for local image descriptors aggregation that is based on marginal approximations, i.e. the approximation of each descriptor component distribution. We quantize each dimension of the feature space, obtaining a visual alphabet that we use to map the image descriptors in a fixed-length visual signature. Experimental results show that our new method outperforms the traditional BOW model in both accuracy and efficiency for the scene recognition task. Moreover, we discover that the marginal-based aggregation provides complementary information with respect to BOW, by combining the two models in a video retrieval system based on TRECVID 2010.