Scene aligned pooling for complex video recognition

Authors:
Liangliang Cao;Yadong Mu;Apostol Natsev;Shih-Fu Chang;Gang Hua;John R. Smith
Affiliations:
IBM T. J. Watson Research Center;Dept. Electrical Engineering, Columbia University;Google Research;Dept. Electrical Engineering, Columbia University;Dept. Computer Science, Stevens Institute of Technology;IBM T. J. Watson Research Center
Venue:
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part II
Year:
2012

Citing 12
Cited 3

Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

International Journal of Computer Vision
Recognizing Action at a Distance

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Semantic representation: search and mining of multimedia content

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Recognizing Human Actions: A Local SVM Approach

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
Actions as Space-Time Shapes

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

International Journal of Computer Vision
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Large linear classification when data cannot fit in memory

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Semantic Model Vectors for Complex Video Event Recognition

IEEE Transactions on Multimedia
HMDB: A large video database for human motion recognition

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Ask the locals: Multi-way local pooling for image recognition

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision

Submodular video hashing: a unified framework towards video pooling and indexing

Proceedings of the 20th ACM international conference on Multimedia
Segmental multi-way local pooling for video recognition

Proceedings of the 21st ACM international conference on Multimedia
Semantic pooling for complex event detection

Proceedings of the 21st ACM international conference on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

Real-world videos often contain dynamic backgrounds and evolving people activities, especially for those web videos generated by users in unconstrained scenarios. This paper proposes a new visual representation, namely scene aligned pooling, for the task of event recognition in complex videos. Based on the observation that a video clip is often composed with shots of different scenes, the key idea of scene aligned pooling is to decompose any video features into concurrent scene components, and to construct classification models adaptive to different scenes. The experiments on two large scale real-world datasets including the TRECVID Multimedia Event Detection 2011 and the Human Motion Recognition Databases (HMDB) show that our new visual representation can consistently improve various kinds of visual features such as different low-level color and texture features, or middle-level histogram of local descriptors such as SIFT, or space-time interest points, and high level semantic model features, by a significant margin. For example, we improve the-state-of-the-art accuracy on HMDB dataset by 20% in terms of accuracy.