An evaluation of bags-of-words and spatio-temporal shapes for action recognition

Authors:
Teofilo de Campos;Mark Barnard;Krystian Mikolajczyk;Josef Kittler;Fei Yan;William Christmas;David Windridge
Affiliations:
CVSSP, University of Surrey, Guildford, GU2 7XH, UK;CVSSP, University of Surrey, Guildford, GU2 7XH, UK;CVSSP, University of Surrey, Guildford, GU2 7XH, UK;CVSSP, University of Surrey, Guildford, GU2 7XH, UK;CVSSP, University of Surrey, Guildford, GU2 7XH, UK;CVSSP, University of Surrey, Guildford, GU2 7XH, UK;CVSSP, University of Surrey, Guildford, GU2 7XH, UK
Venue:
WACV '11 Proceedings of the 2011 IEEE Workshop on Applications of Computer Vision (WACV)
Year:
2011

Citing 0
Cited 5

A new multi-agent system for video objects segmentation and tracking based on spatio-temporal descriptor

Proceedings of the International Conference on Advances in Computing, Communications and Informatics
Directional space-time oriented gradients for 3d visual pattern analysis

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part III
Spatio-temporal video representation with locality-constrained linear coding

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part III
A spatio-temporal pyramid matching for video retrieval

Computer Vision and Image Understanding
Video event detection for fault monitoring in assembly automation

International Journal of Intelligent Systems Technologies and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bags-of-visual-Words (BoW) and Spatio-Temporal Shapes (STS) are two very popular approaches for action recognition from video. The former (BoW) is an un-structured global representation of videos which is built using a large set of local features. The latter (STS) uses a single feature located on a region of interest (where the actor is) in the video. Despite the popularity of these methods, no comparison between them has been done. Also, given that BoW and STS differ intrinsically in terms of context inclusion and globality/locality of operation, an appropriate evaluation framework has to be designed carefully. This paper compares these two approaches using four different datasets with varied degree of space-time specificity of the actions and varied relevance of the contextual background. We use the same local feature extraction method and the same classifier for both approaches. Further to BoW and STS, we also evaluated novel variations of BoW constrained in time or space. We observe that the STS approach leads to better results in all datasets whose background is of little relevance to action classification.