Cost-Sensitive top-down/bottom-up inference for multiscale activity recognition

Authors:
Mohamed R. Amer;Dan Xie;Mingtian Zhao;Sinisa Todorovic;Song-Chun Zhu
Affiliations:
Oregon State University, Corvallis, Oregon;University of California, Los Angeles, California;University of California, Los Angeles, California;Oregon State University, Corvallis, Oregon;University of California, Los Angeles, California
Venue:
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
Year:
2012

Citing 8
Cited 2

A stochastic grammar of images

Foundations and Trends® in Computer Graphics and Vision
Object Detection with Discriminatively Trained Part-Based Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Representing pairwise spatial and temporal relations for action recognition

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part I
Stochastic Representation and Recognition of High-Level Group Activities

International Journal of Computer Vision
A Numerical Study of the Bottom-Up and Top-Down Inference Processes in And-Or Graphs

International Journal of Computer Vision
Sum-product networks for modeling activities with stochastic structure

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Unsupervised learning of event AND-OR grammar and semantics from video

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
A chains model for localizing participants of group activities in videos

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision

Learning latent spatio-temporal compositional model for human action recognition

Proceedings of the 21st ACM international conference on Multimedia
Max-Margin Early Event Detectors

International Journal of Computer Vision

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses a new problem, that of multiscale activity recognition. Our goal is to detect and localize a wide range of activities, including individual actions and group activities, which may simultaneously co-occur in high-resolution video. The video resolution allows for digital zoom-in (or zoom-out) for examining fine details (or coarser scales), as needed for recognition. The key challenge is how to avoid running a multitude of detectors at all spatiotemporal scales, and yet arrive at a holistically consistent video interpretation. To this end, we use a three-layered AND-OR graph to jointly model group activities, individual actions, and participating objects. The AND-OR graph allows a principled formulation of efficient, cost-sensitive inference via an explore-exploit strategy. Our inference optimally schedules the following computational processes: 1) direct application of activity detectors --- called α process; 2) bottom-up inference based on detecting activity parts --- called β process; and 3) top-down inference based on detecting activity context --- called γ process. The scheduling iteratively maximizes the log-posteriors of the resulting parse graphs. For evaluation, we have compiled and benchmarked a new dataset of high-resolution videos of group and individual activities co-occurring in a courtyard of the UCLA campus.