Efficient visual content retrieval and mining in videos

Authors:
Josef Sivic;Andrew Zisserman
Affiliations:
Robotics Research Group, Department of Engineering Science, University of Oxford;Robotics Research Group, Department of Engineering Science, University of Oxford
Venue:
PCM'04 Proceedings of the 5th Pacific Rim Conference on Advances in Multimedia Information Processing - Volume Part II
Year:
2004

Citing 8
Cited 2

Local Grayvalue Invariants for Image Retrieval

IEEE Transactions on Pattern Analysis and Machine Intelligence
Content-based query of image databases: inspirations from text retrieval

Pattern Recognition Letters - Selected papers from the 11th scandinavian conference on image analysis
Modern Information Retrieval

Modern Information Retrieval
Multi-view Matching for Unordered Image Sets, or "How Do I Organize My Holiday Snaps?"

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part I
An Affine Invariant Interest Point Detector

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part I
Video Summaries through Mosaic-Based Shot and Scene Clustering

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV
Object Recognition from Local Scale-Invariant Features

ICCV '99 Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2
Video Google: A Text Retrieval Approach to Object Matching in Videos

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2

A unified framework for object retrieval and mining

IEEE Transactions on Circuits and Systems for Video Technology
Video object mining with local region tracking

MCAM'07 Proceedings of the 2007 international conference on Multimedia content analysis and mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe an image representation for objects and scenes consisting of a configuration of viewpoint covariant regions and their descriptors. This representation enables recognition to proceed successfully despite changes in scale, viewpoint, illumination and partial occlusion. Vector quantization of these descriptors then enables efficient matching on the scale of an entire feature film. We show two applications. The first is to efficient object retrieval where the technology of text retrieval, such as inverted file systems, can be employed at run time to return all shots containing the object in a manner, and with a speed, similar to a Google search for text. The object is specified by a user outlining it in an image, and the object is then delineated in the retrieved shots. The second application is to data mining. We obtain the principal objects, characters and scenes in a video by measuring the reoccurrence of these spatial configurations of viewpoint covariant regions. The applications are illustrated on two full length feature films.