Visual speaker localization aided by acoustic models

Authors:
Gerald Friedland;Chuohao Yeo;Hayley Hung
Affiliations:
International Computer Science Institute, Berkeley, CA, USA;University of California, Berkeley, CA, USA;IDIAP Research Institute, Martigny, Switzerland
Venue:
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Year:
2009

Citing 9
Cited 4

Audio Segmentation and Speaker Localization in Meeting Videos

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 02
Cross-modal prediction in audio-visual communication

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 04
Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information

IEEE Transactions on Computers
On-line multi-modal speaker diarization

Proceedings of the 9th international conference on Multimodal interfaces
Multi-modal speaker diarization of real-world meetings using compressed-domain video features

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Speaker localisation using audio-visual synchrony: an empirical study

CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
The AMI meeting corpus: a pre-announcement

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Speaker association with signal-level audiovisual fusion

IEEE Transactions on Multimedia
Exploring Co-Occurence Between Speech and Body Movement for Audio-Guided Video Localization

IEEE Transactions on Circuits and Systems for Video Technology

Joke-o-mat: browsing sitcoms punchline by punchline

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Joke-o-Mat HD: browsing sitcoms with human derived transcripts

Proceedings of the international conference on Multimedia
Narrative theme navigation for sitcoms supported by fan-generated scripts

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The following paper presents a novel audio-visual approach for unsupervised speaker locationing. Using recordings from a single, low-resolution room overview camera and a single far-field microphone, a state-of-the art audio-only speaker localization system (traditionally called speaker diarization) is extended so that both acoustic and visual models are estimated as part of a joint unsupervised optimization problem. The speaker diarization system first automatically determines the number of speakers and estimates "who spoke when", then, in a second step, the visual models are used to infer the location of the speakers in the video. The experiments were performed on real-world meetings using 4.5 hours of the publicly available AMI meeting corpus. The proposed system is able to exploit audio-visual integration to not only improve the accuracy of a state-of-the-art (audio-only) speaker diarization, but also adds visual speaker locationing at little incremental engineering and computation costs.