Assessing face and speech consistency for monologue detection in video

  • Authors:
  • H. J. Nock;G. Iyengar;C. Neti

  • Affiliations:
  • IBM TJ Watson Research Center, Yorktown Heights, NY;IBM TJ Watson Research Center, Yorktown Heights, NY;IBM TJ Watson Research Center, Yorktown Heights, NY

  • Venue:
  • Proceedings of the tenth ACM international conference on Multimedia
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper considers schemes for determining which of a set of faces on screen, if any, is producing speech in a video soundtrack. Whilst motivated by the TREC 2002 (Video Retrieval Track) monologue detection task, the schemes are also applicable to voice and face-based biometrics systems, for assessing lip synchronization quality in movie editing and computer animation, and for speaker localization in video. Several approaches are discussed: two implementations of a generic mutual-information-based measure of the degree of synchrony between signals, which can be used with or without prior speech and face detection, and a stronger model-based scheme which follows speech and face detection with an assessment of face and lip movement plausibility. Schemes are compared on a corpus of 1016 test cases containing multiple faces and multiple speakers, a test set 200 times larger than the nearest comparable test set of which we are aware. The most successful and computationally cheapest scheme obtains an accuracy of 82% on the task of picking the "consistent" speaker from a set including three confusers. A final experiment demonstrates the potential utility of the scheme for speaker localization in video.