Actor-independent action search using spatiotemporal vocabulary with appearance hashing

Authors:
Rongrong Ji;Hongxun Yao;Xiaoshuai Sun
Affiliations:
Visual Intelligence Laboratory, Department of Computer Science and Engineering, Harbin Institute of Technology, P.O. Box 321, 150001 Harbin, Heilongjiang Province, P.R. China;Visual Intelligence Laboratory, Department of Computer Science and Engineering, Harbin Institute of Technology, P.O. Box 321, 150001 Harbin, Heilongjiang Province, P.R. China;Visual Intelligence Laboratory, Department of Computer Science and Engineering, Harbin Institute of Technology, P.O. Box 321, 150001 Harbin, Heilongjiang Province, P.R. China
Venue:
Pattern Recognition
Year:
2011

Citing 30
Cited 8

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
A Model of Saliency-Based Visual Attention for Rapid Scene Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning variable-length Markov models of behavior

Computer Vision and Image Understanding - Modeling people toward vision-based underatanding of a person's shape, appearance, and movement
Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases

ACM Computing Surveys (CSUR)
Space-time Interest Points

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Video Google: A Text Retrieval Approach to Object Matching in Videos

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Recognizing Human Actions: A Local SVM Approach

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
WARP: Accurate Retrieval of Shapes Using Phase of Fourier Descriptors and Time Warping Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Histograms of Oriented Gradients for Human Detection

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Recognizing Human Actions in Videos Acquired by Uncalibrated Moving Cameras

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
Efficient Visual Event Detection Using Volumetric Features

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
A unified shot boundary detection framework based on graph partition model

Proceedings of the 13th annual ACM international conference on Multimedia
Scalable Recognition with a Vocabulary Tree

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Visual attention detection in video sequences using spatiotemporal cues

MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
A Visual Attention Based Region-of-Interest Determination Framework for Video Sequences*

IEICE - Transactions on Information and Systems
Cast indexing for videos by NCuts and page ranking

Proceedings of the 6th ACM international conference on Image and video retrieval
Efficient spatiotemporal-attention-driven shot matching

Proceedings of the 15th international conference on Multimedia
A 3-dimensional sift descriptor and its application to action recognition

Proceedings of the 15th international conference on Multimedia
UQLIPS: a real-time near-duplicate video clip detection system

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Attention-driven action retrieval with DTW-based 3d descriptor matching

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Place retrieval with graph-based place-view model

MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part II
Hierarchical space-time model enabling efficient search for human actions

IEEE Transactions on Circuits and Systems for Video Technology
Face recognition from 2D and 3D images using 3D Gabor filters

Image and Vision Computing
Robust content-based video copy identification in a large reference database

CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
Fast similarity search and clustering of video sequences on the world-wide-web

IEEE Transactions on Multimedia
Content-Based Copy Retrieval Using Distortion-Based Probabilistic Similarity Search

IEEE Transactions on Multimedia
Spatiotemporal salient points for visual recognition of human actions

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
A fully automated content-based video search engine supporting spatiotemporal queries

IEEE Transactions on Circuits and Systems for Video Technology

Temporal Spectral Residual for fast salient motion detection

Neurocomputing
3D human face description: landmarks measures and geometrical features

Image and Vision Computing
Action segmentation in dance videos

PCM'12 Proceedings of the 13th Pacific-Rim conference on Advances in Multimedia Information Processing
Desynchronization attacks resilient image watermarking scheme based on global restoration and local embedding

Neurocomputing
A recursive embedding algorithm towards lossless 2D vector map watermarking

Digital Signal Processing
Human action recognition employing negative space features

Journal of Visual Communication and Image Representation
Weakly supervised codebook learning by iterative label propagation with graph quantization

Signal Processing
Efficient view based 3-D object retrieval using Hidden Markov Model

3D Research

Quantified Score

Hi-index	0.01

Visualization

Abstract

Human actions in movies and sitcoms usually capture semantic cues for story understanding, which offer a novel search pattern beyond the traditional video search scenario. However, there are great challenges to achieve action-level video search, such as global motions, concurrent actions, and actor appearance variances. In this paper, we introduce a generalized action retrieval framework, which achieves fully unsupervised, robust, and actor-independent action search in large-scale database. First, an Attention Shift model is presented to extract human-focused foreground actions from videos containing global motions or concurrent actions. Subsequently, a spatiotemporal vocabulary is built based on 3D-SIFT features extracted from these human-focused action regions. These 3D-SIFT features offer robustness against rotations and viewpoints. And the spatiotemporal vocabulary guarantees our search efficiency, which is achieved by inverted indexing structure with approximate nearest-neighbor search. In the online ranking, we employ dynamic time warping distance to handle the action duration variances, as well as partial action matching. Finally, an appearance hashing strategy is presented to address the performance degeneration caused by divergent actor appearances. For experimental validation, we have deployed actor-independent action retrieval framework in 3-season ''Friends'' sitcoms (over 30h). In this database, we have reported the best performance (MAP@10.53) with comparisons to alternative and state-of-the-art approaches.