Multi-modal speaker diarization of real-world meetings using compressed-domain video features

Authors:
Gerald Friedland;Hayley Hung;Chuohao Yeo
Affiliations:
Int. Computer Science Institute, 1947 Center Street, Suite 600, Berkeley, CA 94704, USA;IDIAP Research Institute, Rue Marconi 19, CH-1920 Martigny, USA;UC Berkeley, Dept. of EECS, CA 94720, USA
Venue:
ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Year:
2009

Citing 0
Cited 5

Visual speaker localization aided by acoustic models

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Investigating the use of visual focus of attention for audio-visual speaker diarisation

MM '09 Proceedings of the 17th ACM international conference on Multimedia
A speaker diarization method based on the probabilistic fusion of audio-visual location information

Proceedings of the 2009 international conference on Multimodal interfaces
Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Audiovisual diarization of people in video content

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Speaker diarization is originally defined as the task of determining “who spoke when” given an audio track and no other prior knowledge of any kind. The following article shows a multi-modal approach where we improve a state-of-the-art speaker diarization system by combining standard acoustic features (MFCCs) with compressed domain video features. The approach is evaluated on over 4.5 hours of the publicly available AMI meetings dataset which contains challenges such as people standing up and walking out of the room. We show a consistent improvement of about 34% relative in speaker error rate (21% DER) compared to a state-of-the-art audio-only baseline.