Dynamic eye movement datasets and learnt saliency models for visual action recognition

Authors:
Stefan Mathe;Cristian Sminchisescu
Affiliations:
Institute of Mathematics of the Romanian Academy (IMAR), Romania,Department of Computer Science, University of Toronto, Canada;Faculty of Mathematics and Natural Science, University of Bonn, Germany,Institute of Mathematics of the Romanian Academy (IMAR), Romania
Venue:
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part II
Year:
2012

Citing 10
Cited 2

Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

International Journal of Computer Vision
The steerable pyramid: a flexible architecture for multi-scale derivative computation

ICIP '95 Proceedings of the 1995 International Conference on Image Processing (Vol. 3)-Volume 3 - Volume 3
Robust Real-Time Face Detection

International Journal of Computer Vision
On Space-Time Interest Points

International Journal of Computer Vision
The Pascal Visual Object Classes (VOC) Challenge

International Journal of Computer Vision
How to find interesting locations in video: a spatiotemporal interest point detector learned from human eye movements

Proceedings of the 29th DAGM conference on Pattern recognition
Weakly Supervised Learning of Interactions between Humans and Objects

IEEE Transactions on Pattern Analysis and Machine Intelligence
Expandable Data-Driven Graphical Modeling of Human Actions Based on Salient Postures

IEEE Transactions on Circuits and Systems for Video Technology
Chebyshev approximations to the histogram χ2 kernel

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Space-variant descriptor sampling for action recognition based on saliency and eye movements

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VII

Space-variant descriptor sampling for action recognition based on saliency and eye movements

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VII
Static saliency vs. dynamic saliency: a comparative study

Proceedings of the 21st ACM international conference on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

Systems based on bag-of-words models operating on image features collected at maxima of sparse interest point operators have been extremely successful for both computer-based visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with visual processing in biological systems that operate in "saccade and fixate" regimes, the knowledge, methodology, and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions aiming to bridge this gap. First, we complement existing state-of-the art large-scale dynamic computer vision datasets like Hollywood-2[1] and UCF Sports[2] with human eye movements collected under the ecological constraints of the visual action recognition task. To our knowledge these are the first massive human eye tracking datasets of significant size to be collected for video (497,107 frames, each viewed by 16 subjects), unique in terms of their (a) large scale and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as opposed to free-viewing. Second, we introduce novel dynamic consistency and alignment models, which underline the remarkable stability of patterns of visual search among subjects. Third, we leverage the massive amounts of collected data in order to pursue studies and build automatic, end-to-end trainable computer vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-temporal interest point image sampling strategies and human fixations, as well as their impact for visual recognition performance, but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end automatic system, leveraging some of the most advanced computer vision practice, can lead to state of the art results.