Unsupervised learning of event AND-OR grammar and semantics from video

  • Authors:
  • Zhangzhang Si;Mingtao Pei;Benjamin Yao;Song-Chun Zhu

  • Affiliations:
  • Department of Statistics, University of California, Los Angeles, USA;Lab of Intelligent Info. Tech., Beijing Institute of Technology, China;Department of Statistics, University of California, Los Angeles, USA;Department of Statistics, University of California, Los Angeles, USA

  • Venue:
  • ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study the problem of automatically learning event AND-OR grammar from videos of a certain environment, e.g. an office where students conduct daily activities. We propose to learn the event grammar under the information projection and minimum description length principles in a coherent probabilistic framework, without manual supervision about what events happen and when they happen. Firstly a predefined set of unary and binary relations are detected for each video frame: e.g. agent's position, pose and interaction with environment. Then their co-occurrences are clustered into a dictionary of simple and transient atomic actions. Recursively these actions are grouped into longer and complexer events, resulting in a stochastic event grammar. By modeling time constraints of successive events, the learned grammar becomes context-sensitive. We introduce a new dataset of surveillance-style video in office, and present a prototype system for video analysis integrating bottom-up detection, grammatical learning and parsing. On this dataset, the learning algorithm is able to automatically discover important events and construct a stochastic grammar, which can be used to accurately parse newly observed video. The learned grammar can be used as a prior to improve the noisy bottom-up detection of atomic actions. It can also be used to infer semantics of the scene. In general, the event grammar is an efficient way for common knowledge acquisition from video.