Annotating news video with locations

Authors:
Jun Yang;Alexander G. Hauptmann
Affiliations:
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
Venue:
CIVR'06 Proceedings of the 5th international conference on Image and Video Retrieval
Year:
2006

Citing 8
Cited 3

The LIMSI Broadcast News transcription system

Speech Communication - Special issue on automatic transcription of broadcast news data
Interactive Maps for a Digital Video Library

IEEE MultiMedia
Video OCR: indexing digital new libraries by recognition of superimposed captions

Multimedia Systems - Special section on video libraries
Story Segmentation and Detection of Commercials in Broadcast News Video

ADL '98 Proceedings of the Advances in Digital Libraries Conference
Video Google: A Text Retrieval Approach to Object Matching in Videos

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Nymble: a high-performance learning name-finder

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Registration of Video to Geo-Referenced Imagery

ICPR '98 Proceedings of the 14th International Conference on Pattern Recognition-Volume 2 - Volume 2
Naming every individual in news video monologues

Proceedings of the 12th annual ACM international conference on Multimedia

3WNews: who, where, and when in news video

MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
The evolution of visual information retrieval

Journal of Information Science
Semantic entity-relationship model for large-scale multimedia news exploration and recommendation

MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling

Quantified Score

Hi-index	0.00

Visualization

Abstract

The location of video scenes is an important semantic descriptor especially for broadcast news video. In this paper, we propose a learning-based approach to annotate shots of news video with locations extracted from video transcript, based on features from multiple video modalities including syntactic structure of transcript sentences, speaker identity, temporal video structure, and so on. Machine learning algorithms are adopted to combine multi-modal features to solve two sub-problems: (1) whether the location of a video shot is mentioned in the transcript, and if so, (2) among many locations in the transcript, which are correct one(s) for this shot. Experiments on TRECVID dataset demonstrate that our approach achieves approximately 85% accuracy in correctly labeling the location of any shot in news video.