Weakly Supervised Learning of Interactions between Humans and Objects

Authors:
Alessandro Prest;Cordelia Schmid;Vittorio Ferrari
Affiliations:
ETH, Zurich, and INRIA, Grenoble;INRIA, Grenoble;ETH, Zurich
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
2012

Citing 0
Cited 8

Dynamic eye movement datasets and learnt saliency models for visual action recognition

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part II
People watching: human actions as a cue for single view geometry

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part V
Scene semantics from long-term observation of people

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VI
Drawing an automatic sketch of deformable objects using only a few images

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I
Detecting actions, poses, and objects with relational phraselets

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
Action recognition with exemplar based 2.5d graph matching

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
On recognizing actions in still images via multiple features

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part III
Coloring Action Recognition in Still Images

International Journal of Computer Vision

Quantified Score

Hi-index	0.14

Visualization

Abstract

We introduce a weakly supervised approach for learning human actions modeled as interactions between humans and objects. Our approach is human-centric: We first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images annotated only with the action label. Our approach relies on a human detector to initialize the model learning. For robustness to various degrees of visibility, we build a detector that learns to combine a set of existing part detectors. Starting from humans detected in a set of images depicting the action, our approach determines the action object and its spatial relation to the human. Its final output is a probabilistic model of the human-object interaction, i.e., the spatial relation between the human and the object. We present an extensive experimental evaluation on the sports action data set from [1], the PASCAL Action 2010 data set [2], and a new human-object interaction data set.