A Model of Saliency-Based Visual Attention for Rapid Scene Analysis
IEEE Transactions on Pattern Analysis and Machine Intelligence
ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Multiple kernel learning, conic duality, and the SMO algorithm
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Recognizing Human Actions: A Local SVM Approach
ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
Large Scale Multiple Kernel Learning
The Journal of Machine Learning Research
Behavior recognition via sparse spatio-temporal features
ICCCN '05 Proceedings of the 14th International Conference on Computer Communications and Networks
Robust Object Recognition with Cortex-Like Mechanisms
IEEE Transactions on Pattern Analysis and Machine Intelligence
Is bottom-up attention useful for object recognition?
CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
Learning to Detect a Salient Object
IEEE Transactions on Pattern Analysis and Machine Intelligence
Action recognition by dense trajectories
CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Intrinsic Dimensionality Predicts the Saliency of Natural Dynamic Scenes
IEEE Transactions on Pattern Analysis and Machine Intelligence
Dynamic eye movement datasets and learnt saliency models for visual action recognition
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part II
Dynamic eye movement datasets and learnt saliency models for visual action recognition
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part II
Activity representation with motion hierarchies
International Journal of Computer Vision
Hi-index | 0.00 |
Algorithms using "bag of features"-style video representations currently achieve state-of-the-art performance on action recognition tasks, such as the challenging Hollywood2 benchmark [1,2,3]. These algorithms are based on local spatiotemporal descriptors that can be extracted either sparsely (at interest points) or densely (on regular grids), with dense sampling typically leading to the best performance [1]. Here, we investigate the benefit of space-variant processing of inputs, inspired by attentional mechanisms in the human visual system. We employ saliency-mapping algorithms to find informative regions and descriptors corresponding to these regions are either used exclusively, or are given greater representational weight (additional codebook vectors). This approach is evaluated with three state-of-the-art action recognition algorithms [1,2,3], and using several saliency algorithms. We also use saliency maps derived from human eye movements to probe the limits of the approach. Saliency-based pruning allows up to 70% of descriptors to be discarded, while maintaining high performance on Hollywood2. Meanwhile, pruning of 20-50% (depending on model) can even improve recognition. Further improvements can be obtained by combining representations learned separately on salience-pruned and unpruned descriptor sets. Not surprisingly, using the human eye movement data gives the best mean Average Precision (mAP; 61.9%), providing an upper bound on what is possible with a high-quality saliency map. Even without such external data, the Dense Trajectories model [1] enhanced by automated saliency-based descriptor sampling achieves the best mAP (60.0%) reported on Hollywood2 to date.