Fast unsupervised alignment of video and text for indexing/names and faces

Authors:
Subhransu Maji;Ruzena Bajcsy
Affiliations:
University of California: Berkeley, Berkeley, CA;University of California: Berkeley, Berkeley, CA
Venue:
Workshop on multimedia information retrieval on The many faces of multimedia semantics
Year:
2007

Citing 9
Cited 2

CONDENSATION—Conditional Density Propagation forVisual Tracking

International Journal of Computer Vision
A framework for multiple-instance learning

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying

IEEE Transactions on Pattern Analysis and Machine Intelligence
Multiple-Instance Learning for Natural Scene Classification

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Kernel Eigenfaces vs. Kernel Fisherfaces: Face Recognition Using Kernel Methods

FGR '02 Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition
Robust Real-Time Face Detection

International Journal of Computer Vision
Automatic Face Recognition for Film Character Retrieval in Feature-Length Films

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Word sense disambiguation with pictures

HLT-NAACL-LWM '04 Proceedings of the HLT-NAACL 2003 workshop on Learning word meaning from non-linguistic data - Volume 6
Names and faces in the news

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition

Face-and-clothing based people clustering in video content

Proceedings of the international conference on Multimedia information retrieval
Cross-modal alignment for wildlife recognition

Proceedings of the 2nd ACM international workshop on Multimedia analysis for ecological data

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a novel way of combining weakly associated video/audio and text steams in an unsupervised manner which is faster than conventional speech recognition. The technique aligns audio/video and text streams which will enable video search using the associated text. Multimedia of this form includes news broadcast with summaries, parliament proceedings and court trials with transcripts, sports telecast with text commentary, etc. We also show how we can annotate the video with the names of the person appearing in the video which will allow name based indexing/search. We test the technique on a 80 minute video segment downloaded from the website of the International Court of the Former Yugoslavia, with the corresponding transcripts. The proposed technique achieves 88.49% accuracy on sentence level alignments and 95.5% accuracy on the task of assigning names to faces.