A speaker diarization method based on the probabilistic fusion of audio-visual location information

Authors:
Kentaro Ishizuka;Shoko Araki;Kazuhiro Otsuka;Tomohiro Nakatani;Masakiyo Fujimoto
Affiliations:
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan;NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan;NTT Communication Science Laboratories, NTT Corporation, Atsugi, Japan;NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan;NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Venue:
Proceedings of the 2009 international conference on Multimodal interfaces
Year:
2009

Citing 16
Cited 2

Distributed meetings: a meeting capture and broadcasting system

Proceedings of the tenth ACM international conference on Multimedia
Automatic Analysis of Multimodal Group Actions in Meetings

IEEE Transactions on Pattern Analysis and Machine Intelligence
Detection and separation of speech event using audio and video information fusion and its application to robust speech interface

EURASIP Journal on Applied Signal Processing
Towards smart meeting: enabling technologies and a real-world application

Proceedings of the 9th international conference on Multimodal interfaces
A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Multi-modal speaker diarization of real-world meetings using compressed-domain video features

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Blind sparse source separation for unknown number of sources using Gaussian mixture model fitting with Dirichlet prior

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
The AMI meeting corpus: a pre-announcement

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Automatic speech recognition and speech activity detection in the CHIL smart room

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Speaker localization in CHIL lectures: evaluation criteria and results

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Speaker diarization for multi-microphone meetings using only between-channel differences

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Speaker diarization: from broadcast news to lectures

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Blind separation of speech mixtures via time-frequency masking

IEEE Transactions on Signal Processing
Acoustic Beamforming for Speaker Diarization of Meetings

IEEE Transactions on Audio, Speech, and Language Processing
Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

IEEE Transactions on Audio, Speech, and Language Processing
An overview of automatic speaker diarization systems

IEEE Transactions on Audio, Speech, and Language Processing

Speech activity detection for multi-party conversation analyses based on likelihood ratio test on spatial magnitude

IEEE Transactions on Audio, Speech, and Language Processing
Multimodal conversation scene analysis for understanding people's communicative behaviors in face-to-face meetings

HCII'11 Proceedings of the 1st international conference on Human interface and the management of information: interacting with information - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a speaker diarization method for determining ""who spoke when"" in multi-party conversations, based on the probabilistic fusion of audio and visual location information. The audio and visual information is obtained from a compact system designed to analyze round table multi-party conversations. The system consists of two cameras and a triangular microphone array with three microphones, and can cover a spherical region. Speaker locations are estimated from audio and visual observations in terms of azimuths from this recording system. Unlike conventional speech diarization methods, our proposed method estimates the probability of the presence of multiple simultaneous speakers in a physical space with a small microphone setup instead of using a cascade consisting of speech activity detection, direction of arrival estimation, acoustic feature extraction, and information criteria based speaker segmentation. To estimate the speaker presence more correctly, the speech presence probabilities in a physical space are integrated with the probabilities estimated from participants' face locations obtained with a robust particle filtering based face tracker with two cameras equipped with fisheye lenses. The locations in a physical space with highly integrated probabilities are then classified into a certain number of speaker classes by using on-line classification to realize speaker diarization. The probability calculations and speaker classifications are conducted on-line, making it unnecessary to observe all the conversation data. An experiment using real casual conversations, which include more overlaps and short speech segments than formal meetings, showed the advantages of the proposed method.