Learning semantic features for action recognition via diffusion maps

Authors:
Jingen Liu;Yang Yang;Imran Saleemi;Mubarak Shah
Affiliations:
Department of EECS, University of Michigan, Ann Arbor, MI, USA;Department of EECS, University of Central Florida, Orlando, FL, USA;Department of EECS, University of Central Florida, Orlando, FL, USA;Department of EECS, University of Central Florida, Orlando, FL, USA
Venue:
Computer Vision and Image Understanding
Year:
2012

Citing 33
Cited 4

The Recognition of Human Movement Using Temporal Templates

IEEE Transactions on Pattern Analysis and Machine Intelligence
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Estimating Human Body Configurations Using Shape Context Matching

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part III
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Discovering word senses from text

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Laplacian Eigenmaps for dimensionality reduction and data representation

Neural Computation
Unsupervised Learning of Human Motion

IEEE Transactions on Pattern Analysis and Machine Intelligence
Human Motion Analysis: A Review

NAM '97 Proceedings of the 1997 IEEE Workshop on Motion of Non-Rigid and Articulated Objects (NAM '97)
Latent dirichlet allocation

The Journal of Machine Learning Research
Recognizing Human Actions: A Local SVM Approach

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
Actions Sketch: A Novel Action Representation

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Hybrid Models for Human Motion Recognition

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
A Bayesian Hierarchical Model for Learning Natural Scene Categories

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02
Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
On Space-Time Interest Points

International Journal of Computer Vision
Discovering Objects and their Localization in Images

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
Modeling Scenes with Local Descriptors and Latent Aspects

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
Object Categorization by Learned Universal Visual Dictionary

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
View Invariance for Human Action Recognition

International Journal of Computer Vision
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Diffusion Maps and Coarse-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Free viewpoint action recognition using motion history volumes

Computer Vision and Image Understanding - Special issue on modeling people: Vision-based understanding of a person's shape, appearance, movement, and behaviour
Behavior recognition via sparse spatio-temporal features

ICCCN '05 Proceedings of the 14th International Conference on Computer Communications and Networks
Actions as Space-Time Shapes

IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning to Recognize Activities from the Wrong View Point

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part I
Localizing Objects with Smart Dictionaries

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part I
Supervised Learning of Quantizer Codebooks by Information Loss Minimization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Recovering human body configurations: combining segmentation and recognition

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
A survey of vision-based methods for action representation, segmentation and recognition

Computer Vision and Image Understanding
Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture

CVPR'03 Proceedings of the 2003 IEEE computer society conference on Computer vision and pattern recognition
Scene classification via pLSA

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part IV
Visual cue cluster construction via information bottleneck principle and kernel density estimation

CIVR'05 Proceedings of the 4th international conference on Image and Video Retrieval
Cross-view action recognition via view knowledge transfer

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition

Motion interchange patterns for action recognition in unconstrained videos

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VI
Latent semantic learning with structured sparse representation for human action recognition

Pattern Recognition
A survey of video datasets for human action and activity recognition

Computer Vision and Image Understanding
A review of motion analysis methods for human Nonverbal Communication Computing

Image and Vision Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Efficient modeling of actions is critical for recognizing human actions. Recently, bag of video words (BoVW) representation, in which features computed around spatiotemporal interest points are quantized into video words based on their appearance similarity, has been widely and successfully explored. The performance of this representation however, is highly sensitive to two main factors: the granularity, and therefore, the size of vocabulary, and the space in which features and words are clustered, i.e., the distance measure between data points at different levels of the hierarchy. The goal of this paper is to propose a representation and learning framework that addresses both these limitations. We present a principled approach to learning a semantic vocabulary from a large amount of video words using Diffusion Maps embedding. As opposed to flat vocabularies used in traditional methods, we propose to exploit the hierarchical nature of feature vocabularies representative of human actions. Spatiotemporal features computed around interest points in videos form the lowest level of representation. Video words are then obtained by clustering those spatiotemporal features. Each video word is then represented by a vector of Pointwise Mutual Information (PMI) between that video word and training video clips, and is treated as a mid-level feature. At the highest level of the hierarchy, our goal is to further cluster the mid-level features, while exploiting semantically meaningful distance measures between them. We conjecture that the mid-level features produced by similar video sources (action classes) must lie on a certain manifold. To capture the relationship between these features, and retain it during clustering, we propose to use diffusion distance as a measure of similarity between them. The underlying idea is to embed the mid-level features into a lower-dimensional space, so as to construct a compact yet discriminative, high level vocabulary. Unlike some of the supervised vocabulary construction approaches and the unsupervised methods such as pLSA and LDA, Diffusion Maps can capture local relationship between the mid-level features on the manifold. We have tested our approach on diverse datasets and have obtained very promising results.