Learning and parsing video events with goal and intent prediction

Authors:
Mingtao Pei;Zhangzhang Si;Benjamin Z Yao;Song-Chun Zhu
Affiliations:
Beijing Lab of Intelligent Information Technology, Beijing Institute of Technology, China and Department of Statistics, University of California, Los Angeles, United States;Department of Statistics, University of California, Los Angeles, United States;Department of Statistics, University of California, Los Angeles, United States;Department of Statistics, University of California, Los Angeles, United States
Venue:
Computer Vision and Image Understanding
Year:
2013

Citing 23
Cited 0

Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
Recognition of Visual Activities and Interactions by Stochastic Parsing

IEEE Transactions on Pattern Analysis and Machine Intelligence
Recognizing multitasked activities from video using stochastic context-free grammar

Eighteenth national conference on Artificial intelligence
Coupled hidden Markov models for complex action recognition

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
An efficient context-free parsing algorithm

An efficient context-free parsing algorithm
Composite Templates for Cloth Modeling and Sketching

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 1
Recognition of Composite Human Activities through Context-Free Grammar Based Representation

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Minimax Entropy Principle and Its Application to Texture Modeling

Neural Computation
Sharing Visual Features for Multiclass and Multiview Object Detection

IEEE Transactions on Pattern Analysis and Machine Intelligence
Coupled Hidden Semi Markov Models for Activity Recognition

WMVC '07 Proceedings of the IEEE Workshop on Motion and Video Computing
From frequent itemsets to semantically meaningful visual patterns

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
A stochastic grammar of images

Foundations and Trends® in Computer Graphics and Vision
Unsupervised Learning of Probabilistic Context-Free Grammar using Iterative Biclustering

ICGI '08 Proceedings of the 9th international colloquium on Grammatical Inference: Algorithms and Applications
Bottom-Up/Top-Down Image Parsing with Attribute Grammar

IEEE Transactions on Pattern Analysis and Machine Intelligence
CASEE: a hierarchical event representation for the analysis of videos

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Learning Active Basis Model for Object Detection and Recognition

International Journal of Computer Vision
CO3 for ultra-fast and accurate interactive segmentation

Proceedings of the international conference on Multimedia
PADS: A Probabilistic Activity Detection Framework for Video Data

IEEE Transactions on Pattern Analysis and Machine Intelligence
An Extended Grammar System for Learning and Recognizing Complex Visual Events

IEEE Transactions on Pattern Analysis and Machine Intelligence
Probabilistic event logic for interval-based event recognition

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Unsupervised learning of event AND-OR grammar and semantics from video

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Parsing video events with goal inference and intent prediction

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Learning AND-OR Templates for Object Recognition and Detection

IEEE Transactions on Pattern Analysis and Machine Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a framework for parsing video events with stochastic Temporal And-Or Graph (T-AOG) and unsupervised learning of the T-AOG from video. This T-AOG represents a stochastic event grammar. The alphabet of the T-AOG consists of a set of grounded spatial relations including the poses of agents and their interactions with objects in the scene. The terminal nodes of the T-AOG are atomic actions which are specified by a number of grounded relations over image frames. An And-node represents a sequence of actions. An Or-node represents a number of alternative ways of such concatenations. The And-Or nodes in the T-AOG can generate a set of valid temporal configurations of atomic actions, which can be equivalently represented as the language of a stochastic context-free grammar (SCFG). For each And-node we model the temporal relations of its children nodes to distinguish events with similar structures but different temporal patterns and interpolate missing portions of events. This makes the T-AOG grammar context-sensitive. We propose an unsupervised learning algorithm to learn the atomic actions, the temporal relations and the And-Or nodes under the information projection principle in a coherent probabilistic framework. We also propose an event parsing algorithm based on the T-AOG which can understand events, infer the goal of agents, and predict their plausible intended actions. In comparison with existing methods, our paper makes the following contributions. (i) We represent events by a T-AOG with hierarchical compositions of events and the temporal relations between the sub-events. (ii) We learn the grammar, including atomic actions and temporal relations, automatically from the video data without manual supervision. (iii) Our algorithm infers the goal of agents and predicts their intents by a top-down process, handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework. (iv) The algorithm uses event context to improve the detection of atomic actions, segment and recognize objects in the scene. Extensive experiments, including indoor and out door scenes, single and multiple agents events, are conducted to validate the effectiveness of the proposed approach.