A comparative study of encoding, pooling and normalization methods for action recognition

Authors:
Xingxing Wang;LiMin Wang;Yu Qiao
Affiliations:
Shenzhen Key lab of CVPR, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China;Shenzhen Key lab of CVPR, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China,Department of Information Engineeing, The Chinese University of Hong Kong, China;Shenzhen Key lab of CVPR, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China,Department of Information Engineeing, The Chinese University of Hong Kong, China
Venue:
ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part III
Year:
2012

Citing 20
Cited 0

Exploiting generative models in discriminative classifiers

Proceedings of the 1998 conference on Advances in neural information processing systems II
Recognizing Human Actions: A Local SVM Approach

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
On Space-Time Interest Points

International Journal of Computer Vision
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study

International Journal of Computer Vision
Behavior recognition via sparse spatio-temporal features

ICCCN '05 Proceedings of the 14th International Conference on Computer Communications and Networks
Kernel Codebooks for Scene Categorization

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part III
The Pascal Visual Object Classes (VOC) Challenge

International Journal of Computer Vision
Modeling temporal structure of decomposable motion segments for activity classification

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part II
Improving the fisher kernel for large-scale image classification

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Human activity analysis: A review

ACM Computing Surveys (CSUR)
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Recognizing human actions by attributes

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Machine Recognition of Human Activities: A Survey

IEEE Transactions on Circuits and Systems for Video Technology
Action bank: A high-level representation of activity in video

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Learning latent temporal structure for complex event detection

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
HMDB: A large video database for human motion recognition

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
In defense of soft-assignment coding

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Learning spatiotemporal graphs of human activities

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bag of visual words (BoVW) models have been widely and successfully used in video based action recognition. One key step in constructing BoVW representation is to encode feature with a codebook. Recently, a number of new encoding methods have been developed to improve the performance of BoVW based object recognition and scene classification, such as soft assignment encoding [1], sparse encoding [2], locality-constrained linear encoding [3] and Fisher kernel encoding [4]. However, their effects for action recognition are still unknown. The main objective of this paper is to evaluate and compare these new encoding methods in the context of video based action recognition. We also analyze and evaluate the combination of encoding methods with different pooling and normalization strategies. We carry out experiments on KTH dataset [5] and HMDB51 dataset [6]. The results show the new encoding methods can significantly improve the recognition accuracy compared with classical VQ. Among them, Fisher kernel encoding and sparse encoding have the best performance. By properly choosing pooling and normalization methods, we achieve the state-of-the-art performance on HMDB51.