Speaker diarization for multi-microphone meetings using only between-channel differences

Authors:
Jose M. Pardo;Xavier Anguera;Chuck Wooters
Affiliations:
International Computer Science Institute, Berkeley, CA;International Computer Science Institute, Berkeley, CA;International Computer Science Institute, Berkeley, CA
Venue:
MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Year:
2006

Citing 4
Cited 6

A Robust Method for Speech Signal Time-Delay Estimation in Reverberant Rooms

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97) -Volume 1 - Volume 1
Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Further progress in meeting recognition: the ICSI-SRI spring 2005 speech-to-text evaluation system

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
The 2004 ICSI-SRI-UW meeting recognition system

MLMI'04 Proceedings of the First international conference on Machine Learning for Multimodal Interaction

Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information

IEEE Transactions on Computers
The LIA RT'07 Speaker Diarization System

Multimodal Technologies for Perception of Humans
A speaker diarization method based on the probabilistic fusion of audio-visual location information

Proceedings of the 2009 international conference on Multimodal interfaces
Speech activity detection for multi-party conversation analyses based on likelihood ratio test on spatial magnitude

IEEE Transactions on Audio, Speech, and Language Processing
A review on speaker diarization systems and approaches

Speech Communication
Real-time audio-visual analysis for multiperson videoconferencing

Advances in Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a method to extract speaker turn segmentation from multiple distant microphones (MDM) using only delay values found via a cross-correlation between the available channels. The method is robust against the number of speakers (which is unknown to the system), the number of channels, and the acoustics of the room. The delays between channels are processed and clustered to obtain a segmentation hypothesis. We have obtained a 31.2% diarization error rate (DER) for the NIST´s RT05s MDM conference room evaluation set. For a MDM subset of NIST´s RT04s development set, we have obtained 36.93% DER and 35.73% DER*. Comparing those results with the ones presented by Ellis and Liu [8], who also used between-channels differences for the same data, we have obtained 43% relative improvement in the error rate.