Image Classification with the Fisher Vector: Theory and Practice

  • Authors:
  • Jorge Sánchez;Florent Perronnin;Thomas Mensink;Jakob Verbeek

  • Affiliations:
  • CIEM-CONICET, FaMAF, Universidad Nacional de Córdoba, Córdoba, Argentina X5000HUA;Xerox Research Centre Europe, Meylan, France 38240;Inteligent Systems Lab Amsterdam, University of Amsterdam, Amsterdam, The Netherlands 1098 XH;LEAR Team, INRIA Grenoble, Montbonnot, France 38330

  • Venue:
  • International Journal of Computer Vision
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

A standard approach to describe an image for classification and retrieval purposes is to extract a set of local patch descriptors, encode them into a high dimensional vector and pool them into an image-level signature. The most common patch encoding strategy consists in quantizing the local descriptors into a finite set of prototypical elements. This leads to the popular Bag-of-Visual words representation. In this work, we propose to use the Fisher Kernel framework as an alternative patch encoding strategy: we describe patches by their deviation from an "universal" generative Gaussian mixture model. This representation, which we call Fisher vector has many advantages: it is efficient to compute, it leads to excellent results even with efficient linear classifiers, and it can be compressed with a minimal loss of accuracy using product quantization. We report experimental results on five standard datasets--PASCAL VOC 2007, Caltech 256, SUN 397, ILSVRC 2010 and ImageNet10K--with up to 9M images and 10K classes, showing that the FV framework is a state-of-the-art patch encoding technique.