Scene semantics from long-term observation of people

Authors:
Vincent Delaitre;David F. Fouhey;Ivan Laptev;Josef Sivic;Abhinav Gupta;Alexei A. Efros
Affiliations:
INRIA/École Normale Supérieure, Paris, France;Carnegie Mellon University;INRIA/École Normale Supérieure, Paris, France;INRIA/École Normale Supérieure, Paris, France;Carnegie Mellon University;INRIA/École Normale Supérieure, Paris, France, Carnegie Mellon University
Venue:
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VI
Year:
2012

Citing 20
Cited 1

Efficient Graph-Based Image Segmentation

International Journal of Computer Vision
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Combining Image Regions and Human Activity for Indirect Object Recognition in Indoor Wide-Angle Views

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
Geometric Context from a Single Image

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
Simultaneous Visual Recognition of Manipulation Actions and Manipulated Objects

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part II
Robust Higher Order Potentials for Enforcing Label Consistency

International Journal of Computer Vision
Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Functional object class detection based on learned affordance cues

ICVS'08 Proceedings of the 6th international conference on Computer vision systems
Object Detection with Discriminatively Trained Part-Based Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Unsupervised learning of functional categories in video scenes

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part II
Thinking inside the box: using appearance models and context based on room geometry

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part VI
TextonBoost: joint appearance, shape and context modeling for multi-class object recognition and segmentation

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part I
Weakly Supervised Learning of Interactions between Humans and Objects

IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning semantic scene models by trajectory analysis

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part III
Learning to recognize objects in egocentric activities

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
From 3D scene geometry to human workspace

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Functional categorization of objects using real-time markerless motion capture

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
What makes a chair a chair?

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Density-aware person detection and tracking in crowds

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
People watching: human actions as a cue for single view geometry

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part V

People watching: human actions as a cue for single view geometry

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part V

Quantified Score

Hi-index	0.00

Visualization

Abstract

Our everyday objects support various tasks and can be used by people for different purposes. While object classification is a widely studied topic in computer vision, recognition of object function, i.e., what people can do with an object and how they do it, is rarely addressed. In this paper we construct a functional object description with the aim to recognize objects by the way people interact with them. We describe scene objects (sofas, tables, chairs) by associated human poses and object appearance. Our model is learned discriminatively from automatically estimated body poses in many realistic scenes. In particular, we make use of time-lapse videos from YouTube providing a rich source of common human-object interactions and minimizing the effort of manual object annotation. We show how the models learned from human observations significantly improve object recognition and enable prediction of characteristic human poses in new scenes. Results are shown on a dataset of more than 400,000 frames obtained from 146 time-lapse videos of challenging and realistic indoor scenes.