A chains model for localizing participants of group activities in videos

Authors:
Mohamed R. Amer;Sinisa Todorovic
Affiliations:
Oregon State University, USA;Oregon State University, USA
Venue:
ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Year:
2011

Citing 0
Cited 6

Propagative hough voting for human activity recognition

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part III
Cost-Sensitive top-down/bottom-up inference for multiscale activity recognition

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
Collective activity localization with contextual spatial pyramid

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part III
Viewpoint invariant collective activity recognition with relative action context

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part III
Recognizing Interactive Group Activities Using Temporal Interaction Matrices and Their Riemannian Statistics

International Journal of Computer Vision
A survey of video datasets for human action and activity recognition

Computer Vision and Image Understanding

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a video, we would like to recognize group activities, localize video parts where these activities occur, and detect actors involved in them. This advances prior work that typically focuses only on video classification. We make a number of contributions. First, we specify a new, mid-level, video feature aimed at summarizing local visual cues into bags of the right detections (BORDs). BORDs seek to identify the right people who participate in a target group activity among many noisy people detections. Second, we formulate a new, generative, chains model of group activities. Inference of the chains model identifies a subset of BORDs in the video that belong to occurrences of the activity, and organizes them in an ensemble of temporal chains. The chains extend over, and thus localize, the time intervals occupied by the activity. We formulate a new MAP inference algorithm that iterates two steps: i) Warps the chains of BORDs in space and time to their expected locations, so the transformed BORDs can better summarize local visual cues; and ii) Maximizes the posterior probability of the chains. We outperform the state of the art on benchmark UT-Human Interaction and Collective Activities datasets, under reasonable running times.