Unsupervised learning of event AND-OR grammar and semantics from video

Authors:
Zhangzhang Si;Mingtao Pei;Benjamin Yao;Song-Chun Zhu
Affiliations:
Department of Statistics, University of California, Los Angeles, USA;Lab of Intelligent Info. Tech., Beijing Institute of Technology, China;Department of Statistics, University of California, Los Angeles, USA;Department of Statistics, University of California, Los Angeles, USA
Venue:
ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Year:
2011

Citing 0
Cited 6

Modeling complex temporal composition of actionlets for activity prediction

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part I
Cost-Sensitive top-down/bottom-up inference for multiscale activity recognition

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
Unsupervised temporal commonality discovery

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
Learning latent spatio-temporal compositional model for human action recognition

Proceedings of the 21st ACM international conference on Multimedia
Learning and parsing video events with goal and intent prediction

Computer Vision and Image Understanding
Iterative rule segmentation under minimum description length for unsupervised transduction grammar induction

SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the problem of automatically learning event AND-OR grammar from videos of a certain environment, e.g. an office where students conduct daily activities. We propose to learn the event grammar under the information projection and minimum description length principles in a coherent probabilistic framework, without manual supervision about what events happen and when they happen. Firstly a predefined set of unary and binary relations are detected for each video frame: e.g. agent's position, pose and interaction with environment. Then their co-occurrences are clustered into a dictionary of simple and transient atomic actions. Recursively these actions are grouped into longer and complexer events, resulting in a stochastic event grammar. By modeling time constraints of successive events, the learned grammar becomes context-sensitive. We introduce a new dataset of surveillance-style video in office, and present a prototype system for video analysis integrating bottom-up detection, grammatical learning and parsing. On this dataset, the learning algorithm is able to automatically discover important events and construct a stochastic grammar, which can be used to accurately parse newly observed video. The learned grammar can be used as a prior to improve the noisy bottom-up detection of atomic actions. It can also be used to infer semantics of the scene. In general, the event grammar is an efficient way for common knowledge acquisition from video.