Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope
International Journal of Computer Vision
Recognizing Action at a Distance
ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Semantic representation: search and mining of multimedia content
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Recognizing Human Actions: A Local SVM Approach
ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories
CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words
International Journal of Computer Vision
LIBLINEAR: A Library for Large Linear Classification
The Journal of Machine Learning Research
Large linear classification when data cannot fit in memory
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Semantic Model Vectors for Complex Video Event Recognition
IEEE Transactions on Multimedia
HMDB: A large video database for human motion recognition
ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Ask the locals: Multi-way local pooling for image recognition
ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Submodular video hashing: a unified framework towards video pooling and indexing
Proceedings of the 20th ACM international conference on Multimedia
Segmental multi-way local pooling for video recognition
Proceedings of the 21st ACM international conference on Multimedia
Semantic pooling for complex event detection
Proceedings of the 21st ACM international conference on Multimedia
Hi-index | 0.00 |
Real-world videos often contain dynamic backgrounds and evolving people activities, especially for those web videos generated by users in unconstrained scenarios. This paper proposes a new visual representation, namely scene aligned pooling, for the task of event recognition in complex videos. Based on the observation that a video clip is often composed with shots of different scenes, the key idea of scene aligned pooling is to decompose any video features into concurrent scene components, and to construct classification models adaptive to different scenes. The experiments on two large scale real-world datasets including the TRECVID Multimedia Event Detection 2011 and the Human Motion Recognition Databases (HMDB) show that our new visual representation can consistently improve various kinds of visual features such as different low-level color and texture features, or middle-level histogram of local descriptors such as SIFT, or space-time interest points, and high level semantic model features, by a significant margin. For example, we improve the-state-of-the-art accuracy on HMDB dataset by 20% in terms of accuracy.