Discovering important people and objects for egocentric video summarization

Authors:
Joydeep Ghosh
Affiliations:
University of Texas at Austin
Venue:
CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Year:
2012

Citing 0
Cited 5

Detecting eye contact using wearable eye-tracking glasses

Proceedings of the 2012 ACM Conference on Ubiquitous Computing
Learning to recognize daily actions using gaze

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part I
Fusion of multiple visual cues for visual saliency extraction from wearable camera settings with strong motion

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part III
Video summarization: techniques and classification

ICCVG'12 Proceedings of the 2012 international conference on Computer Vision and Graphics
Active labeling application applied to food-related object recognition

Proceedings of the 5th international workshop on Multimedia for cooking & eating activities

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a video summarization approach for egocentric or “wearable” camera data. Given hours of video, the proposed method produces a compact storyboard summary of the camera wearer's day. In contrast to traditional keyframe selection techniques, the resulting summary focuses on the most important objects and people with which the camera wearer interacts. To accomplish this, we develop region cues indicative of high-level saliency in egocentric video — such as the nearness to hands, gaze, and frequency of occurrence — and learn a regressor to predict the relative importance of any new region based on these cues. Using these predictions and a simple form of temporal event detection, our method selects frames for the storyboard that reflect the key object-driven happenings. Critically, the approach is neither camera-wearer-specific nor object-specific; that means the learned importance metric need not be trained for a given user or context, and it can predict the importance of objects and people that have never been seen previously. Our results with 17 hours of egocentric data show the method's promise relative to existing techniques for saliency and summarization.