Audio-visual synchrony for detection of monologues in video archives

Authors:
G. Iyengar;H. J. Nock;C. Neti
Affiliations:
IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA;IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA;IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
Venue:
ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 2
Year:
2003

Citing 1
Cited 8

Assessing face and speech consistency for monologue detection in video

Proceedings of the tenth ACM international conference on Multimedia

Multimedia content processing through cross-modal association

MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
Audiovisual speech synchrony measure: application to biometrics

EURASIP Journal on Applied Signal Processing
A neural network approach to audio-assisted movie dialogue detection

Neurocomputing
Audio-visual identity verification: an introductory overview

Progress in nonlinear speech processing
MultiFusion: A boosting approach for multimedia fusion

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
A framework for dialogue detection in movies

MRCS'06 Proceedings of the 2006 international conference on Multimedia Content Representation, Classification and Security
Fisher Linear Discriminant Analysis for text-image combination in multimedia information retrieval

Pattern Recognition
Combining supervised and unsupervised models via unconstrained probabilistic embedding

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present our approach to detect monologues in video shots. A monologue shot is defined as a shot containing a talking person in the video channel with the corresponding speech in the audio channel. Whilst motivated by the TREC 2002 video retrieval track (VT02), the underlying approach of synchrony between audio and video signals are also applicable for voice and face-based biometrics, assessing of lip-synchronization quality in movie editing, and for speaker localization in video. Our approach is envisioned as a two part scheme. We first detect occurrence of speech and face in a video shot. In shots containing both speech and a face, we distinguish monologue shots as those shots where the speech and facial movements are synchronized. To measure the synchrony between speech and facial movements we use a mutual-information based measure. Experiments with the VT02 corpus indicate that using synchrony, the average precision improves by more than 50% relative compared to using face and speech information alone. Our synchrony based monologue detector submission had the best average precision performance (in VT02) amongst 18 different submissions.