Space-variant descriptor sampling for action recognition based on saliency and eye movements

Authors:
Eleonora Vig;Michael Dorr;David Cox
Affiliations:
The Rowland Institute at Harvard, Cambridge, MA;Schepens Eye Research Institute, Harvard Medical School, Boston, MA;The Rowland Institute at Harvard, Cambridge, MA
Venue:
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VII
Year:
2012

Citing 13
Cited 2

A Model of Saliency-Based Visual Attention for Rapid Scene Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
Space-time Interest Points

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Multiple kernel learning, conic duality, and the SMO algorithm

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Recognizing Human Actions: A Local SVM Approach

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
Large Scale Multiple Kernel Learning

The Journal of Machine Learning Research
Behavior recognition via sparse spatio-temporal features

ICCCN '05 Proceedings of the 14th International Conference on Computer Communications and Networks
Robust Object Recognition with Cortex-Like Mechanisms

IEEE Transactions on Pattern Analysis and Machine Intelligence
Is bottom-up attention useful for object recognition?

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
Learning to Detect a Salient Object

IEEE Transactions on Pattern Analysis and Machine Intelligence
Action recognition by dense trajectories

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Intrinsic Dimensionality Predicts the Saliency of Natural Dynamic Scenes

IEEE Transactions on Pattern Analysis and Machine Intelligence
Dynamic eye movement datasets and learnt saliency models for visual action recognition

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part II

Dynamic eye movement datasets and learnt saliency models for visual action recognition

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part II
Activity representation with motion hierarchies

International Journal of Computer Vision

Quantified Score

Hi-index	0.00

Visualization

Abstract

Algorithms using "bag of features"-style video representations currently achieve state-of-the-art performance on action recognition tasks, such as the challenging Hollywood2 benchmark [1,2,3]. These algorithms are based on local spatiotemporal descriptors that can be extracted either sparsely (at interest points) or densely (on regular grids), with dense sampling typically leading to the best performance [1]. Here, we investigate the benefit of space-variant processing of inputs, inspired by attentional mechanisms in the human visual system. We employ saliency-mapping algorithms to find informative regions and descriptors corresponding to these regions are either used exclusively, or are given greater representational weight (additional codebook vectors). This approach is evaluated with three state-of-the-art action recognition algorithms [1,2,3], and using several saliency algorithms. We also use saliency maps derived from human eye movements to probe the limits of the approach. Saliency-based pruning allows up to 70% of descriptors to be discarded, while maintaining high performance on Hollywood2. Meanwhile, pruning of 20-50% (depending on model) can even improve recognition. Further improvements can be obtained by combining representations learned separately on salience-pruned and unpruned descriptor sets. Not surprisingly, using the human eye movement data gives the best mean Average Precision (mAP; 61.9%), providing an upper bound on what is possible with a high-quality saliency map. Even without such external data, the Dense Trajectories model [1] enhanced by automated saliency-based descriptor sampling achieves the best mAP (60.0%) reported on Hollywood2 to date.