Selecting the best faces to index presentation videos

Authors:
Michele Merler;John R. Kender
Affiliations:
Columbia University, New York, NY, USA;Columbia University, New York, NY, USA
Venue:
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Year:
2011

Citing 4
Cited 2

Optimal Pose for Face Recognition

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Constructing Face Image Logs that are Both Complete and Concise

CRV '07 Proceedings of the Fourth Canadian Conference on Computer and Robot Vision
VAST MM: multimedia browser for presentation video

Proceedings of the 6th ACM international conference on Image and video retrieval
Taking the bite out of automated naming of characters in TV video

Image and Vision Computing

Analysis, indexing and visualization of presentation videos

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Arm gesture variations during presentations are correlated with conjunctions indicating contrast

Proceedings of the 2012 ACM workshop on User experience in e-learning and augmented technologies in education

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a system to select the most representative faces in unstructured presentation videos with respect to two criteria: to optimize matching accuracy between pairs of face tracks, and to select humanly preferred face icons for indexing purposes. We first extract face tracks using state-of-the-art face detection and tracking. A small subset of images are then selected per track in order to maximize matching accuracy between tracks. Finally, representative images are extracted for each speaker in order to build a face index of the video. We tested our approach on 3 unstructured presentation videos of approximately 45 minutes each, for a total of a quarter million frames. Compared to the standard min-min approach, our method achieves higher track matching accuracy (94.22%), while using 6% of the running time. Using an optimal combination of 3 user preference measures, we were able to build face indexes containing 54 speakers (out of the 58 present in the videos) indexing into 795 detected tracks.