Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem

Authors:
Gerald Friedland;Chuohao Yeo;Hayley Hung
Affiliations:
International Computer Science Institute, Berkeley, CA;Institute for Infocomm Research, Singapore;IDIAP Research Institute, Martigny, Switzerland
Venue:
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Year:
2010

Citing 20
Cited 3

W4: Real-Time Surveillance of People and Their Activities

IEEE Transactions on Pattern Analysis and Machine Intelligence
Convergence Properties of the Nelder--Mead Simplex Method in Low Dimensions

SIAM Journal on Optimization
Robust Real Time Color Tracking

RoboCup 2000: Robot Soccer World Cup IV
A Real-time Computer Vision System for Measuring Traffic Parameters

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Lucas-Kanade 20 Years On: A Unifying Framework

International Journal of Computer Vision
Audio Segmentation and Speaker Localization in Meeting Videos

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 02
Cross-modal prediction in audio-visual communication

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 04
Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information

IEEE Transactions on Computers
On-line multi-modal speaker diarization

Proceedings of the 9th international conference on Multimodal interfaces
Multi-modal speaker diarization of real-world meetings using compressed-domain video features

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Visual speaker localization aided by acoustic models

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Investigating the use of visual focus of attention for audio-visual speaker diarisation

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Recognizing visual focus of attention from head pose in natural meetings

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on human computing
Speaker localisation using audio-visual synchrony: an empirical study

CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
The AMI meeting corpus: a pre-announcement

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Estimating Dominance in Multi-Party Meetings Using Speaker Diarization

IEEE Transactions on Audio, Speech, and Language Processing
Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

IEEE Transactions on Audio, Speech, and Language Processing
Prosodic and other Long-Term Features for Speaker Diarization

IEEE Transactions on Audio, Speech, and Language Processing
Speaker association with signal-level audiovisual fusion

IEEE Transactions on Multimedia
Exploring Co-Occurence Between Speech and Body Movement for Audio-Guided Video Localization

IEEE Transactions on Circuits and Systems for Video Technology

Viewing by interactions: media-oriented operators for reviewing recorded sessions on tv

Proceddings of the 9th international interactive conference on Interactive television
Scalable multimedia content analysis on parallel platforms using python

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Audiovisual diarization of people in video content

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The following article presents a novel audio-visual approach for unsupervised speaker localization in both time and space and systematically analyzes its unique properties. Using recordings from a single, low-resolution room overview camera and a single far-field microphone, a state-of-the-art audio-only speaker diarization system (speaker localization in time) is extended so that both acoustic and visual models are estimated as part of a joint unsupervised optimization problem. The speaker diarization system first automatically determines the speech regions and estimates “who spoke when,” then, in a second step, the visual models are used to infer the location of the speakers in the video. We call this process “dialocalization.” The experiments were performed on real-world meetings using 4.5 hours of the publicly available AMI meeting corpus. The proposed system is able to exploit audio-visual integration to not only improve the accuracy of a state-of-the-art (audio-only) speaker diarization, but also adds visual speaker localization at little incremental engineering and computation costs. The combined algorithm has different properties, such as increased robustness, that cannot be observed in algorithms based on single modalities. The article describes the algorithm, presents benchmarking results, explains its properties, and systematically discusses the contributions of each modality.