Spatio-temporal layout of human actions for improved bag-of-words action detection

Authors:
G. J. Burghouts;K. Schutte
Affiliations:
TNO, Intelligent Imaging, Oude Waalsdorperweg 63, The Hague, The Netherlands;TNO, Intelligent Imaging, Oude Waalsdorperweg 63, The Hague, The Netherlands
Venue:
Pattern Recognition Letters
Year:
2013

Citing 9
Cited 1

The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study

International Journal of Computer Vision
Evaluating Color Descriptors for Object and Scene Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
An overview of contest on semantic description of human activities (SDHA) 2010

ICPR'10 Proceedings of the 20th International conference on Recognizing patterns in signals, speech, images, and videos
Variations of a hough-voting action recognition system

ICPR'10 Proceedings of the 20th International conference on Recognizing patterns in signals, speech, images, and videos
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Human activity prediction: Early recognition of ongoing activities from streaming videos

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Modeling spatial layout with fisher vectors for image categorization

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision

Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos

Machine Vision and Applications

Quantified Score

Hi-index	0.10

Visualization

Abstract

We investigate how human action recognition can be improved by considering spatio-temporal layout of actions. From literature, we adopt a pipeline consisting of STIP features, a random forest to quantize the features into histograms, and an SVM classifier. Our goal is to detect 48 human actions, ranging from simple actions such as walk to complex actions such as exchange. Our contribution to improve the performance of this pipeline by exploiting a novel spatio-temporal layout of the 48 actions. Here each STIP feature does not in the video contributes to the histogram bins by a unity value, but rather by a weight given by its spatio-temporal probability. We propose 6 configurations of spatio-temporal layout, where the varied parameters are the coordinate system and the modeling of the action and its context. Our model of layout does not change any other parameter of the pipeline, it requires no re-learning of the random forest, yields a limited increase of the size of its resulting representation by only a factor two, and at a minimal additional computational cost of only a handful of operations per feature. Extensive experiments show that the layout is demonstrated to be distinctive of actions that involve trajectories, (dis)appearance, kinematics, and interactions. The visualization of each action's layout illustrates that our approach is indeed able to model spatio-temporal patterns of each action. Each layout is experimentally shown to be optimal for a specific set of actions. Generally, the context has more effect than the choice of coordinate system. The most impressive improvements are achieved for complex actions involving items. For 43 out of 48 human actions, the performance is better or equal when spatio-temporal layout is included. In addition, we show our method outperforms state-of-the-art for the IXMAS and UT-Interaction datasets.