Differential video coding of face and gesture events in presentation videos
Computer Vision and Image Understanding - Special issue on event detection in video
Timeline-based information assimilation in multimedia surveillance and monitoring systems
Proceedings of the third ACM international workshop on Video surveillance & sensor networks
Audio-visual synchrony for detection of monologues in video archives
ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 2
Audiovisual speech synchrony measure: application to biometrics
EURASIP Journal on Applied Signal Processing
Multimedia multimodal methodologies
ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Information theoretic feature extraction for audio-visual speech recognition
IEEE Transactions on Signal Processing
Speaker localisation using audio-visual synchrony: an empirical study
CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
Audio-visual identity verification: an introductory overview
Progress in nonlinear speech processing
Detecting motion synchrony by video tubes
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Hi-index | 0.00 |
This paper considers schemes for determining which of a set of faces on screen, if any, is producing speech in a video soundtrack. Whilst motivated by the TREC 2002 (Video Retrieval Track) monologue detection task, the schemes are also applicable to voice and face-based biometrics systems, for assessing lip synchronization quality in movie editing and computer animation, and for speaker localization in video. Several approaches are discussed: two implementations of a generic mutual-information-based measure of the degree of synchrony between signals, which can be used with or without prior speech and face detection, and a stronger model-based scheme which follows speech and face detection with an assessment of face and lip movement plausibility. Schemes are compared on a corpus of 1016 test cases containing multiple faces and multiple speakers, a test set 200 times larger than the nearest comparable test set of which we are aware. The most successful and computationally cheapest scheme obtains an accuracy of 82% on the task of picking the "consistent" speaker from a set including three confusers. A final experiment demonstrates the potential utility of the scheme for speaker localization in video.