Semantic-level Understanding of Human Actions and Interactions using Event Hierarchy

  • Authors:
  • Sangho Park;J. K. Aggarwal

  • Affiliations:
  • The University of Texas at Austin;The University of Texas at Austin

  • Venue:
  • CVPRW '04 Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'04) Volume 1 - Volume 01
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Understanding human behavior in video data is essential in numerous applications including surveillance, video annotation/retrieval, and human-computer interfaces. This paper describes a framework for recognizing human actions and interactions in video by using three levels of abstraction. At low level, the poses of individual body parts including head, torso, arms and legs are recognized using individual Bayesian networks (BNs), which are then integrated to obtain an overall body pose. At mid level, the actions of a single person are modeled using a dynamic Bayesian network (DBN) with temporal links between identical states of the Bayesian network at time t and t+1. At high level, the results of mid-level descriptions for each person are juxtaposed along a common time line to identify an interaction between two persons. The linguistic 'verb argument structure' is used to represent human action in terms of triplets. Spatial and temporal constraints are used for a decision tree to recognize specific interactions. A meaningful semantic description in terms of subject-verb-object is obtained. Our method provides a user-friendly natural-language description of several human interactions, and correctly describes positive, neutral, and negative interactions occurring between two persons. Example sequences of real persons are presented to illustrate the paradigm.