Audio Segmentation and Speaker Localization in Meeting Videos
ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 02
Cross-modal prediction in audio-visual communication
ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 04
Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information
IEEE Transactions on Computers
On-line multi-modal speaker diarization
Proceedings of the 9th international conference on Multimodal interfaces
Multi-modal speaker diarization of real-world meetings using compressed-domain video features
ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Speaker localisation using audio-visual synchrony: an empirical study
CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
The AMI meeting corpus: a pre-announcement
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Speaker association with signal-level audiovisual fusion
IEEE Transactions on Multimedia
Exploring Co-Occurence Between Speech and Body Movement for Audio-Guided Video Localization
IEEE Transactions on Circuits and Systems for Video Technology
Joke-o-mat: browsing sitcoms punchline by punchline
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Joke-o-Mat HD: browsing sitcoms with human derived transcripts
Proceedings of the international conference on Multimedia
Narrative theme navigation for sitcoms supported by fan-generated scripts
Multimedia Tools and Applications
Hi-index | 0.00 |
The following paper presents a novel audio-visual approach for unsupervised speaker locationing. Using recordings from a single, low-resolution room overview camera and a single far-field microphone, a state-of-the art audio-only speaker localization system (traditionally called speaker diarization) is extended so that both acoustic and visual models are estimated as part of a joint unsupervised optimization problem. The speaker diarization system first automatically determines the number of speakers and estimates "who spoke when", then, in a second step, the visual models are used to infer the location of the speakers in the video. The experiments were performed on real-world meetings using 4.5 hours of the publicly available AMI meeting corpus. The proposed system is able to exploit audio-visual integration to not only improve the accuracy of a state-of-the-art (audio-only) speaker diarization, but also adds visual speaker locationing at little incremental engineering and computation costs.