Learning latent spatio-temporal compositional model for human action recognition

Authors:
Xiaodan Liang;Liang Lin;Liangliang Cao
Affiliations:
Sun Yat-Sen University, GuangZhou, China;Sun Yat-Sen University, GuangZhou, China;IBM Research, New York, USA
Venue:
Proceedings of the 21st ACM international conference on Multimedia
Year:
2013

Citing 34
Cited 0

The concave-convex procedure

Neural Computation
Activity Recognition and Abnormality Detection with the Switching Hidden Semi-Markov Model

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
On Space-Time Interest Points

International Journal of Computer Vision
A 3-dimensional sift descriptor and its application to action recognition

Proceedings of the 15th international conference on Multimedia
A stochastic grammar of images

Foundations and Trends® in Computer Graphics and Vision
SIFT-Bag kernel for video event analysis

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Real-time human action recognition by luminance field trajectory analysis

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Semantic event representation and recognition using syntactic attribute graph grammar

Pattern Recognition Letters
A stochastic graph grammar for compositional object representation and recognition

Pattern Recognition
Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Object Detection with Discriminatively Trained Part-Based Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Object, scene and actions: combining multiple features for human action recognition

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part I
Modeling the temporal extent of actions

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part I
Modeling temporal structure of decomposable motion segments for activity classification

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part II
Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin

IEEE Transactions on Pattern Analysis and Machine Intelligence
Discriminative Video Pattern Search for Efficient Action Detection

IEEE Transactions on Pattern Analysis and Machine Intelligence
Real-time human action search using random forest based hough voting

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Learning context for collective activity recognition

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Action recognition by dense trajectories

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
A Constrained Probabilistic Petri Net Framework for Human Activity Detection in Video

IEEE Transactions on Multimedia
Exploring probabilistic localized video representation for human action recognition

Multimedia Tools and Applications
Learning contour-fragment-based shape model with And-Or tree representation

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Action bank: A high-level representation of activity in video

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Discovering discriminative action parts from mid-level video representations

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Learning latent temporal structure for complex event detection

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Action recognition by exploring data distribution and feature correlation

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Multi-view latent variable discriminative models for action recognition

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Unsupervised learning of event AND-OR grammar and semantics from video

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Learning spatiotemporal graphs of human activities

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Leveraging high-level and low-level features for multimedia event detection

Proceedings of the 20th ACM international conference on Multimedia
Knowledge adaptation for ad hoc multimedia event detection with few exemplars

Proceedings of the 20th ACM international conference on Multimedia
Spatio-Temporal phrases for activity recognition

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part III
Trajectory-Based modeling of human actions with motion reference points

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part V
Cost-Sensitive top-down/bottom-up inference for multiscale activity recognition

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

Action recognition is an important problem in multimedia understanding. This paper addresses this problem by building an expressive compositional action model. We model one action instance in the video with an ensemble of spatio-temporal compositions: a number of discrete temporal anchor frames, each of which is further decomposed to a layout of deformable parts. In this way, our model can identify a Spatio-Temporal And-Or Graph (STAOG) to represent the latent structure of actions \emph{e.g.} triple jumping, swinging and high jumping. The STAOG model comprises four layers: (i) a batch of leaf-nodes in bottom for detecting various action parts within video patches; (ii) the or-nodes over bottom, i.e. switch variables to activate their children leaf-nodes for structural variability; (iii) the and-nodes within an anchor frame for verifying spatial composition; and (iv) the root-node at top for aggregating scores over temporal anchor frames. Moreover, the contextual interactions are defined between leaf-nodes in both spatial and temporal domains. For model training, we develop a novel weakly supervised learning algorithm which iteratively determines the structural configuration (e.g. the production of leaf-nodes associated with the or-nodes) along with the optimization of multi-layer parameters. By fully exploiting spatio-temporal compositions and interactions, our approach handles well large intra-class action variance (\emph{e.g.} different views, individual appearances, spatio-temporal structures). The experimental results on the challenging databases demonstrate superior performance of our approach over other methods.