Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis

Authors:
Q. V. Le;W. Y. Zou;S. Y. Yeung;A. Y. Ng
Affiliations:
Comput. Sci. Dept., Stanford Univ., Stanford, CA, USA;Comput. Sci. Dept., Stanford Univ., Stanford, CA, USA;Comput. Sci. Dept., Stanford Univ., Stanford, CA, USA;Comput. Sci. Dept., Stanford Univ., Stanford, CA, USA
Venue:
CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Year:
2011

Citing 0
Cited 34

Action recognition via bio-inspired features: The richness of center-surround interaction

Computer Vision and Image Understanding
Supervised class-specific dictionary learning for sparse modeling in action recognition

Pattern Recognition
Sparse Modeling of Human Actions from Motion Imagery

International Journal of Computer Vision
State of the Art Report on Video-Based Graphics and Video Visualization

Computer Graphics Forum
Deep nonlinear metric learning with independent subspace analysis for face verification

Proceedings of the 20th ACM international conference on Multimedia
Multi-channel shape-flow kernel descriptors for robust video event detection and retrieval

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part II
Complex events detection using data-driven concepts

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part III
Directional space-time oriented gradients for 3d visual pattern analysis

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part III
A convolutional treelets binary feature approach to fast keypoint recognition

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part V
Trajectory-Based modeling of human actions with motion reference points

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part V
Motion interchange patterns for action recognition in unconstrained videos

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VI
Space-variant descriptor sampling for action recognition based on saliency and eye movements

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VII
Atomic action features: a new feature for action recognition

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I
Learning invariant feature hierarchies

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I
Recognizing complex events using large margin joint low-level event model

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
Towards space-time semantics in two frames

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part III
Action recognition using linear dynamic systems

Pattern Recognition
Fixed frame temporal pooling

AI'12 Proceedings of the 25th Australasian joint conference on Advances in Artificial Intelligence
Action segmentation in dance videos

PCM'12 Proceedings of the 13th Pacific-Rim conference on Advances in Multimedia Information Processing
Auto learning temporal atomic actions for activity classification

Pattern Recognition
Latent semantic learning with structured sparse representation for human action recognition

Pattern Recognition
A line based pose representation for human action recognition

Image Communication
Action recognition using canonical correlation kernels

ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part III
Folk dance recognition using a bag of words approach and ISA/STIP features

Proceedings of the 6th Balkan Conference in Informatics
Combining multiple sensors for event recognition of older people

Proceedings of the 1st ACM international workshop on Multimedia indexing and information retrieval for healthcare
Action recognition using invariant features under unexampled viewing conditions

Proceedings of the 21st ACM international conference on Multimedia
A feature construction method for general object recognition

Pattern Recognition
Combining modality specific deep neural networks for emotion recognition in video

Proceedings of the 15th ACM on International conference on multimodal interaction
A local descriptor based on Laplacian pyramid coding for action recognition

Pattern Recognition Letters
Multiple scale-specific representations for improved human action recognition

Pattern Recognition Letters
Deep feature learning using target priors with applications in ECoG signal decoding for BCI

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Matching mixtures of curves for human action recognition

Computer Vision and Image Understanding
Graph-based approach for human action recognition using spatio-temporal features

Journal of Visual Communication and Image Representation
Multimedia event detection with multimodal feature fusion and temporal concept localization

Machine Vision and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Previous work on action recognition has focused on adapting hand-designed local features, such as SIFT or HOG, from static images to the video domain. In this paper, we propose using unsupervised feature learning as a way to learn features directly from video data. More specifically, we present an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data. We discovered that, despite its simplicity, this method performs surprisingly well when combined with deep learning techniques such as stacking and convolution to learn hierarchical representations. By replacing hand-designed features with our learned features, we achieve classification results superior to all previous published results on the Hollywood2, UCF, KTH and YouTube action recognition datasets. On the challenging Hollywood2 and YouTube action datasets we obtain 53.3% and 75.8% respectively, which are approximately 5% better than the current best published results. Further benefits of this method, such as the ease of training and the efficiency of training and prediction, will also be discussed. You can download our code and learned spatio-temporal features here: http://ai.stanford.edu/~wzou/.