Tracking people and recognizing their activities

  • Authors:
  • David Forsyth;Deva Kannan Ramanan

  • Affiliations:
  • University of California, Berkeley;University of California, Berkeley

  • Venue:
  • Tracking people and recognizing their activities
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

An important, open vision problem is to automatically describe what people are doing in a sequence of video. This problem is difficult for several reasons. Firstly, one needs to determine how many people (if any) are in each frame and estimate where they are and what their arms and legs are doing. But finding people and localizing their limbs is hard because people (a) move fast and unpredictably, (b) wear a variety of different clothes, and (c) appear in a variety of poses. Secondly, one must describe what each person is doing; this problem is poorly understood, not least because there is no known natural or canonical set of categories into which to classify activities. This thesis addresses a number of key issues that are needed to build a working system. Firstly, we develop a completely automatic person tracker that accurately tracks torsos, arms, legs, and heads. Our system works in two stages; it first (a) builds a model of appearance of each person in a video and then (b) tracks by detecting those models in each frame ("tracking by model-building and detection"). By looking for coherence across a video, our system can also build models of unknown objects. We use it to build articulated models of various animals; these models can be used to detect the animals in new images. This way, we can think of our tracking algorithm as a system that builds models for object detection. We then marry our tracker with a motion synthesis engine that works by re-assembling pre-recorded motion clips. The synthesis engine generates new motions that are human-like and close to the image measurements reported by the tracker. By using labeled motion clips, our synthesizer also generates activity labels for each image frame ("analysis by synthesis"). We have extensively tested our system, running it on hundreds of thousands of frames of unscripted indoor and outdoor activity, a feature-length film ('Run Lola Run'), and legacy sports footage (from the 2002 World Series and 1998 Winter Olympics).