Parsing human motion with stretchable models

Authors:
B. Sapp;D. Weiss;B. Taskar
Affiliations:
Univ. of Pennsylvania, Philadelphia, PA, USA;Univ. of Pennsylvania, Philadelphia, PA, USA;Univ. of Pennsylvania, Philadelphia, PA, USA
Venue:
CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Year:
2011

Citing 0
Cited 6

Human context: modeling human-human interactions for monocular 3d pose estimation

AMDO'12 Proceedings of the 7th international conference on Articulated Motion and Deformable Objects
Group tracking: exploring mutual relations for multiple object tracking

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part III
Using linking features in learning non-parametric part models

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part III
Appearance sharing for collective human pose estimation

ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part I
A review of motion analysis methods for human Nonverbal Communication Computing

Image and Vision Computing
A multi-objective evolutionary algorithm-based ensemble optimizer for feature selection and classification with neural network models

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of articulated human pose estimation in videos using an ensemble of tractable models with rich appearance, shape, contour and motion cues. In previous articulated pose estimation work on unconstrained videos, using temporal coupling of limb positions has made little to no difference in performance over parsing frames individually. One crucial reason for this is that joint parsing of multiple articulated parts over time involves intractable inference and learning problems, and previous work has resorted to approximate inference and simplified models. We overcome these computational and modeling limitations using an ensemble of tractable submodels which couple locations of body joints within and across frames using expressive cues. Each submodel is responsible for tracking a single joint through time (e.g., left elbow) and also models the spatial arrangement of all joints in a single frame. Because of the tree structure of each submodel, we can perform efficient exact inference and use rich temporal features that depend on image appearance, e.g., color tracking and optical flow contours. We propose and experimentally investigate a hierarchy of submodel combination methods, and we find that a highly efficient max-marginal combination method outperforms much slower (by orders of magnitude) approximate inference using dual decomposition. We apply our pose model on a new video dataset of highly varied and articulated poses from TV shows. We show significant quantitative and qualitative improvements over state-of-the-art single-frame pose estimation approaches.