Mining spatiotemporal video patterns towards robust action retrieval

  • Authors:
  • Liujuan Cao;Rongrong Ji;Yue Gao;Wei Liu;Qi Tian

  • Affiliations:
  • Harbin Engineering University, Harbin 150001, China;Columbia University, New York City 10027, United States;Department of Automation, Tsinghua University, 100086, China;Columbia University, New York City 10027, United States;University of Texas at San Antonio, San Antonio 78249-1644, United States

  • Venue:
  • Neurocomputing
  • Year:
  • 2013

Quantified Score

Hi-index 0.01

Visualization

Abstract

In this paper, we present a spatiotemporal co-location video pattern mining approach with application to robust action retrieval in YouTube videos. First, we introduce an attention shift scheme to detect and partition the focused human actions from YouTube videos, which is based upon the visual saliency [13] modeling together with both the face [35] and body [32] detectors. From the segmented spatiotemporal human action regions, we extract 3D-SIFT [17] detector. Then, we quantize all detected interest points from the reference YouTube videos into a vocabulary, based on which assign each individual interest point with a word identity. An APrior based frequent itemset mining scheme is then deployed over the spatiotemporal co-located words to discover co-location video patterns. Finally, we fuse both visual words and patterns and leverage a boosting based feature selection to output the final action descriptors, which incorporates the ranking distortion of the conjunctive queries into the boosting objective. We carried out quantitative evaluations over both KTH human motion benchmark [26], as well as over 60-hour YouTube videos, with comparisons to the state-of-the-arts.