Sparse coding on local spatial-temporal volumes for human action recognition

  • Authors:
  • Yan Zhu;Xu Zhao;Yun Fu;Yuncai Liu

  • Affiliations:
  • Shanghai Jiao Tong University, Shanghai, China;Shanghai Jiao Tong University, Shanghai, China;Department of CSE, SUNY at Buffalo, NY;Shanghai Jiao Tong University, Shanghai, China

  • Venue:
  • ACCV'10 Proceedings of the 10th Asian conference on Computer vision - Volume Part II
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

By extracting local spatial-temporal features from videos, many recently proposed approaches for action recognition achieve promising performance. The Bag-of-Words (BoW) model is commonly used in the approaches to obtain the video level representations. However, BoW model roughly assigns each feature vector to its closest visual word, therefore inevitably causing nontrivial quantization errors and impairing further improvements on classification rates. To obtain a more accurate and discriminative representation, in this paper, we propose an approach for action recognition by encoding local 3D spatial-temporal gradient features within the sparse coding framework. In so doing, each local spatial-temporal feature is transformed to a linear combination of a few "atoms" in a trained dictionary. In addition, we also investigate the construction of the dictionary under the guidance of transfer learning. We collect a large set of diverse video clips of sport games and movies, from which a set of universal atoms composed of the dictionary are learned by an online learning strategy. We test our approach on KTH dataset and UCF sports dataset. Experimental results demonstrate that our approach outperforms the state-of-art techniques on KTH dataset and achieves the comparable performance on UCF sports dataset.