A local descriptor based on Laplacian pyramid coding for action recognition

Authors:
Xiantong Zhen;Ling Shao
Affiliations:
-;-
Venue:
Pattern Recognition Letters
Year:
2013

Citing 21
Cited 1

The Recognition of Human Movement Using Temporal Templates

IEEE Transactions on Pattern Analysis and Machine Intelligence
Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns

IEEE Transactions on Pattern Analysis and Machine Intelligence
Space-time Interest Points

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Recognizing Human Actions: A Local SVM Approach

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
Object Recognition with Features Inspired by Visual Cortex

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02
Rapid Biologically-Inspired Scene Classification Using Features Shared with Visual Attention

IEEE Transactions on Pattern Analysis and Machine Intelligence
Behavior recognition via sparse spatio-temporal features

ICCCN '05 Proceedings of the 14th International Conference on Computer Communications and Networks
A 3-dimensional sift descriptor and its application to action recognition

Proceedings of the 15th international conference on Multimedia
Object Class Recognition and Localization Using Sparse Features with Limited Receptive Fields

International Journal of Computer Vision
Kernel Codebooks for Scene Categorization

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part III
An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part II
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Feature detector and descriptor evaluation in human action recognition

Proceedings of the ACM International Conference on Image and Video Retrieval
Biologically inspired feature manifold for scene classification

IEEE Transactions on Image Processing
Convolutional learning of spatio-temporal features

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part VI
A survey of vision-based methods for action representation, segmentation and recognition

Computer Vision and Image Understanding
Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Spatiotemporal salient points for visual recognition of human actions

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
HMDB: A large video database for human motion recognition

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
In defense of soft-assignment coding

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision

A tensor motion descriptor based on histograms of gradients and optical flow

Pattern Recognition Letters

Quantified Score

Hi-index	0.10

Visualization

Abstract

We present a new descriptor for local representation of human actions. In contrast to state-of-the-art descriptors, which use spatio-temporal features to describe cuboids detected from video sequences, we propose to employ a 2D descriptor based on the Laplacian pyramid for efficiently encoding spatio-temporal regions of interest. Image templates including structural planes and motion templates, are firstly extracted from a cuboid to encode the structural and motion features. A 2D Laplacian pyramid is then performed to decompose each of those images into a series of sub-band feature maps, which is followed by a two-stage feature extraction, i.e., Gabor filtering and max pooling. Motion-related edge and orientation information is enhanced after the filtering. To capture more discriminative and invariant features, max pooling is applied to the outputs of Gabor filtering, between scales within filter banks and over spatial neighbors. The obtained local features associated with cuboids are fed to the localized soft-assignment coding with max pooling on the Bag-of-Words (BoWs) model to represent an action. The image templates, i.e., MHI and TOP, explicitly encode the motion and structure information in the video sequences and the proposed Laplacian pyramid coding descriptor provides an informative representation of them due to the multi-scale analysis. The employment of localized soft-assignment coding and max pooling gives a robust representation of actions. Experimental results on the benchmark KTH dataset and the newly released and challenging HMDB51 dataset demonstrate the effectiveness of the proposed method for human action recognition.