Speaker localisation using audio-visual synchrony: an empirical study

Authors:
Harriet J. Nock;Giridharan Iyengar;Chalapathy Neti
Affiliations:
IBM TJ Watson Research Center, Yorktown Heights, NY;IBM TJ Watson Research Center, Yorktown Heights, NY;IBM TJ Watson Research Center, Yorktown Heights, NY
Venue:
CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
Year:
2003

Citing 4
Cited 8

Elements of information theory

Elements of information theory
Assessing face and speech consistency for monologue detection in video

Proceedings of the tenth ACM international conference on Multimedia
A real-time prototype for small-vocabulary audio-visual ASR

ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 1
Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus

EURASIP Journal on Applied Signal Processing

Multimodal processing by finding common cause

Communications of the ACM - Multimodal interfaces that flex, adapt, and persist
Visual speaker localization aided by acoustic models

MM '09 Proceedings of the 17th ACM international conference on Multimedia
The state of the art in image and video retrieval

CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Synching models with infants: a perceptual-level model of infant audio-visual synchrony detection

Cognitive Systems Research
A review on speaker diarization systems and approaches

Speech Communication
Real-time audio-visual analysis for multiperson videoconferencing

Advances in Multimedia
Audiovisual diarization of people in video content

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper reviews definitions of audio-visual synchrony and examines their empirical behaviour on test sets up to 200 times larger than used by other authors. The results give new insights into the practical utility of existing synchrony definitions and justify application of audio-visual synchrony techniques to the problem of active speaker localisation in broadcast video. Performance is evaluated using a test set of twelve clips of alternating speakers from the multiple speaker CUAVE corpus. Accuracy of 76% is obtained for the task of identifying the active member of a speaker pair at different points in time, comparable to performance given by two purely video image-based schemes. Accuracy of 65% is obtained on the more challenging task of locating a point within a 100×100 pixel square centered on the active speaker's mouth without no prior face detection; the performance upper bound if perfect face detection were available is 69%. This result is significantly better than two purely video image-based schemes.