Recognizing Human Actions: A Local SVM Approach
ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
International Journal of Computer Vision
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
The Journal of Machine Learning Research
Behavior recognition via sparse spatio-temporal features
ICCCN '05 Proceedings of the 14th International Conference on Computer Communications and Networks
Robust Object Recognition with Cortex-Like Mechanisms
IEEE Transactions on Pattern Analysis and Machine Intelligence
Self-taught learning: transfer learning from unlabeled data
Proceedings of the 24th international conference on Machine learning
A 3-dimensional sift descriptor and its application to action recognition
Proceedings of the 15th international conference on Multimedia
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words
International Journal of Computer Vision
Online dictionary learning for sparse coding
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
A study on sampling strategies in space-time domain for recognition applications
MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling
Fast human action classification and VOI localization with enhanced sparse coding
Journal of Visual Communication and Image Representation
Hi-index | 0.00 |
By extracting local spatial-temporal features from videos, many recently proposed approaches for action recognition achieve promising performance. The Bag-of-Words (BoW) model is commonly used in the approaches to obtain the video level representations. However, BoW model roughly assigns each feature vector to its closest visual word, therefore inevitably causing nontrivial quantization errors and impairing further improvements on classification rates. To obtain a more accurate and discriminative representation, in this paper, we propose an approach for action recognition by encoding local 3D spatial-temporal gradient features within the sparse coding framework. In so doing, each local spatial-temporal feature is transformed to a linear combination of a few "atoms" in a trained dictionary. In addition, we also investigate the construction of the dictionary under the guidance of transfer learning. We collect a large set of diverse video clips of sport games and movies, from which a set of universal atoms composed of the dictionary are learned by an online learning strategy. We test our approach on KTH dataset and UCF sports dataset. Experimental results demonstrate that our approach outperforms the state-of-art techniques on KTH dataset and achieves the comparable performance on UCF sports dataset.