Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope
International Journal of Computer Vision
The steerable pyramid: a flexible architecture for multi-scale derivative computation
ICIP '95 Proceedings of the 1995 International Conference on Image Processing (Vol. 3)-Volume 3 - Volume 3
Robust Real-Time Face Detection
International Journal of Computer Vision
International Journal of Computer Vision
The Pascal Visual Object Classes (VOC) Challenge
International Journal of Computer Vision
Proceedings of the 29th DAGM conference on Pattern recognition
Weakly Supervised Learning of Interactions between Humans and Objects
IEEE Transactions on Pattern Analysis and Machine Intelligence
Expandable Data-Driven Graphical Modeling of Human Actions Based on Salient Postures
IEEE Transactions on Circuits and Systems for Video Technology
Chebyshev approximations to the histogram χ2 kernel
CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Space-variant descriptor sampling for action recognition based on saliency and eye movements
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VII
Space-variant descriptor sampling for action recognition based on saliency and eye movements
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VII
Static saliency vs. dynamic saliency: a comparative study
Proceedings of the 21st ACM international conference on Multimedia
Hi-index | 0.00 |
Systems based on bag-of-words models operating on image features collected at maxima of sparse interest point operators have been extremely successful for both computer-based visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with visual processing in biological systems that operate in "saccade and fixate" regimes, the knowledge, methodology, and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions aiming to bridge this gap. First, we complement existing state-of-the art large-scale dynamic computer vision datasets like Hollywood-2[1] and UCF Sports[2] with human eye movements collected under the ecological constraints of the visual action recognition task. To our knowledge these are the first massive human eye tracking datasets of significant size to be collected for video (497,107 frames, each viewed by 16 subjects), unique in terms of their (a) large scale and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as opposed to free-viewing. Second, we introduce novel dynamic consistency and alignment models, which underline the remarkable stability of patterns of visual search among subjects. Third, we leverage the massive amounts of collected data in order to pursue studies and build automatic, end-to-end trainable computer vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-temporal interest point image sampling strategies and human fixations, as well as their impact for visual recognition performance, but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end automatic system, leveraging some of the most advanced computer vision practice, can lead to state of the art results.