On the limited memory BFGS method for large scale optimization
Mathematical Programming: Series A and B
Video Manga: generating semantically meaningful video summaries
MULTIMEDIA '99 Proceedings of the seventh ACM international conference on Multimedia (Part 1)
Readings in multimedia computing and networking
Advances in Automatic Text Summarization
Advances in Automatic Text Summarization
A user attention model for video summarization
Proceedings of the tenth ACM international conference on Multimedia
Cognitive Status and Form of Reference in Multimodal Human-Computer Interaction
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Temporal Classification of Natural Gesture and Application to Video Coding
CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Analysis of Gesture and Action in Technical Talks for Video Indexing
CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
On coreference resolution performance metrics
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Gesture improves coreference resolution
NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Gesture salience as a hidden variable for coreference resolution and keyframe extraction
Journal of Artificial Intelligence Research
Hi-index | 0.00 |
Creating video recordings of events such as lectures or meetings is increasingly inexpensive and easy. However, reviewing the content of such video may be time-consuming and difficult. Our goal is to produce a "comic book" summary, in which a transcript is augmented with keyframes that disambiguate and clarify accompanying text. Unlike most previous keyframe extraction systems which rely primarily on visual cues, we present a linguistically-motivated approach that selects keyframes that contain salient gestures. Rather than learning gesture salience directly, it is estimated by measuring the contribution of gesture to understanding other discourse phenomena. More specifically, we bootstrap from multimodal coreference resolution to identify gestures that improve performance. We then select keyframes that capture these gestures. Our model predicts gesture salience as a hidden variable in a conditional framework, with observable features from both the visual and textual modalities. This approach significantly outperforms competitive baselines that do not use gesture information.