Can computers learn from humans to see better?: inferring scene semantics from viewers' eye movements

Authors:
Ramanathan Subramanian;Victoria Yanulevskaya;Nicu Sebe
Affiliations:
University of Trento, Trento, Italy;University of Trento, Trento, Italy;University of Trento, Trento, Italy
Venue:
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Year:
2011

Citing 20
Cited 2

Neural Network-Based Face Detection

IEEE Transactions on Pattern Analysis and Machine Intelligence
Computer Vision: A Modern Approach

Computer Vision: A Modern Approach
Robust Real-Time Face Detection

International Journal of Computer Vision
Robust clustering of eye movement recordings for quantification of visual interest

Proceedings of the 2004 symposium on Eye tracking research & applications
Naked image detection based on adaptive and extensible skin color model

Pattern Recognition
Seam carving for content-aware image resizing

ACM SIGGRAPH 2007 papers
TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context

International Journal of Computer Vision
Webcam-Based Visual Gaze Estimation

ICIAP '09 Proceedings of the 15th International Conference on Image Analysis and Processing
Automated localization of affective objects and actions in images via caption text-cum-eye gaze analysis

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Object Detection with Discriminatively Trained Part-Based Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Affective image classification using features inspired by psychology and art theory

Proceedings of the international conference on Multimedia
Content without context is meaningless

Proceedings of the international conference on Multimedia
Why did the person cross the road (there)? scene understanding using probabilistic logic models and common sense reasoning

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part II
An eye fixation database for saliency detection in images

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
what is the chance of happening: a new way to predict where people look

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part V
Knowledge based activity recognition with dynamic bayesian network

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part VI
Measuring and Predicting Object Importance

International Journal of Computer Vision
Baby talk: Understanding and generating simple image descriptions

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Affective video content representation and modeling

IEEE Transactions on Multimedia
Foveation scalable video coding with automatic fixation selection

IEEE Transactions on Image Processing

In the eye of the beholder: employing statistical analysis and eye tracking for analyzing abstract paintings

Proceedings of the 20th ACM international conference on Multimedia
Graph-based joint clustering of fixations and visual entities

ACM Transactions on Applied Perception (TAP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes an attempt to bridge the semantic gap between computer vision and scene understanding employing eye movements. Even as computer vision algorithms can efficiently detect scene objects, discovering semantic relationships between these objects is as essential for scene understanding. Humans understand complex scenes by rapidly moving their eyes (saccades) to selectively focus on salient entities (fixations). For 110 social scenes, we compared verbal descriptions provided by observers against eye movements recorded during a free-viewing task. Data analysis confirms (i) a strong correlation between task-explicit linguistic descriptions and task-implicit eye movements, both of which are influenced by underlying scene semantics and (ii) the ability of eye movements in the form of fixations and saccades to indicate salient entities and entity relationships mentioned in scene descriptions. We demonstrate how eye movements are useful for inferring the meaning of social (everyday scenes depicting human activities) and affective (emotion-evoking content like expressive faces, nudes) scenes. While saliency has always been studied through the prism of fixations, we show that saccades are particularly useful for (i) distinguishing mild and high-intensity facial expressions and (ii) discovering interactive actions between scene entities.