Assessing face and speech consistency for monologue detection in video

Authors:
H. J. Nock;G. Iyengar;C. Neti
Affiliations:
IBM TJ Watson Research Center, Yorktown Heights, NY;IBM TJ Watson Research Center, Yorktown Heights, NY;IBM TJ Watson Research Center, Yorktown Heights, NY
Venue:
Proceedings of the tenth ACM international conference on Multimedia
Year:
2002

Citing 0
Cited 9

Differential video coding of face and gesture events in presentation videos

Computer Vision and Image Understanding - Special issue on event detection in video
Timeline-based information assimilation in multimedia surveillance and monitoring systems

Proceedings of the third ACM international workshop on Video surveillance & sensor networks
Audio-visual synchrony for detection of monologues in video archives

ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 2
Audiovisual speech synchrony measure: application to biometrics

EURASIP Journal on Applied Signal Processing
Multimedia multimodal methodologies

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Information theoretic feature extraction for audio-visual speech recognition

IEEE Transactions on Signal Processing
Speaker localisation using audio-visual synchrony: an empirical study

CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
Audio-visual identity verification: an introductory overview

Progress in nonlinear speech processing
Detecting motion synchrony by video tubes

MM '11 Proceedings of the 19th ACM international conference on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper considers schemes for determining which of a set of faces on screen, if any, is producing speech in a video soundtrack. Whilst motivated by the TREC 2002 (Video Retrieval Track) monologue detection task, the schemes are also applicable to voice and face-based biometrics systems, for assessing lip synchronization quality in movie editing and computer animation, and for speaker localization in video. Several approaches are discussed: two implementations of a generic mutual-information-based measure of the degree of synchrony between signals, which can be used with or without prior speech and face detection, and a stronger model-based scheme which follows speech and face detection with an assessment of face and lip movement plausibility. Schemes are compared on a corpus of 1016 test cases containing multiple faces and multiple speakers, a test set 200 times larger than the nearest comparable test set of which we are aware. The most successful and computationally cheapest scheme obtains an accuracy of 82% on the task of picking the "consistent" speaker from a set including three confusers. A final experiment demonstrates the potential utility of the scheme for speaker localization in video.