Joint segmentation and classification of human actions in video

Authors:
Minh Hoai; Zhen-Zhong Lan;F. De la Torre
Affiliations:
Carnegie Mellon Univ., Pittsburgh, PA, USA;Carnegie Mellon Univ., Pittsburgh, PA, USA;Carnegie Mellon Univ., Pittsburgh, PA, USA
Venue:
CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Year:
2011

Citing 0
Cited 8

Kernelized temporal cut for online temporal segmentation and recognition

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part III
A method for online analysis of structured processes using bayesian filters and echo state networks

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part III
A conditional random field-based model for joint sequence segmentation and classification

Pattern Recognition
Incremental slow feature analysis with indefinite kernel for online temporal video segmentation

ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part II
Knives are picked before slices are cut: recognition through activity sequence analysis

Proceedings of the 5th international workshop on Multimedia for cooking & eating activities
Temporal segmentation and assignment of successive actions in a long-term video

Pattern Recognition Letters
Learning discriminative localization from weakly labeled data

Pattern Recognition
Max-Margin Early Event Detectors

International Journal of Computer Vision

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic video segmentation and action recognition has been a long-standing problem in computer vision. Much work in the literature treats video segmentation and action recognition as two independent problems; while segmentation is often done without a temporal model of the activity, action recognition is usually performed on pre-segmented clips. In this paper we propose a novel method that avoids the limitations of the above approaches by jointly performing video segmentation and action recognition. Unlike standard approaches based on extensions of dynamic Bayesian networks, our method is based on a discriminative temporal extension of the spatial bag-of-words model that has been very popular in object recognition. The classification is performed robustly within a multi-class SVM framework whereas the inference over the segments is done efficiently with dynamic programming. Experimental results on honeybee, Weizmann, and Hollywood datasets illustrate the benefits of our approach compared to state-of-the-art methods.