Learning Sparse Representations for Human Action Recognition

Authors:
Tanaya Guha;Rabab K Ward
Affiliations:
University of British Columbia, Vancouver;University of British Columbia, Vancouver
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
2012

Citing 0
Cited 9

FaSTIP: a new method for detection and description of space-time interest points for human activity classification

Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing
Manifold-constrained coding and sparse representation for human action recognition

Pattern Recognition
Online human gesture recognition from motion data streams

Proceedings of the 21st ACM international conference on Multimedia
Learning sparse representations for view-independent human action recognition based on fuzzy distances

Neurocomputing
Dynamic action recognition based on dynemes and Extreme Learning Machine

Pattern Recognition Letters
One-shot learning gesture recognition from RGB-D data using bag of features

The Journal of Machine Learning Research
Robust action recognition using local motion and group sparsity

Pattern Recognition
Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos

Machine Vision and Applications
Unsupervised images segmentation via incremental dictionary learning based sparse representation

Information Sciences: an International Journal

Quantified Score

Hi-index	0.14

Visualization

Abstract

This paper explores the effectiveness of sparse representations obtained by learning a set of overcomplete basis (dictionary) in the context of action recognition in videos. Although this work concentrates on recognizing human movements—physical actions as well as facial expressions—the proposed approach is fairly general and can be used to address other classification problems. In order to model human actions, three overcomplete dictionary learning frameworks are investigated. An overcomplete dictionary is constructed using a set of spatio-temporal descriptors (extracted from the video sequences) in such a way that each descriptor is represented by some linear combination of a small number of dictionary elements. This leads to a more compact and richer representation of the video sequences compared to the existing methods that involve clustering and vector quantization. For each framework, a novel classification algorithm is proposed. Additionally, this work also presents the idea of a new local spatio-temporal feature that is distinctive, scale invariant, and fast to compute. The proposed approach repeatedly achieves state-of-the-art results on several public data sets containing various physical actions and facial expressions.