Automatic cluster complexity and quantity selection: towards robust speaker diarization

Authors:
Xavier Anguera;Chuck Wooters;Javier Hernando
Affiliations:
International Computer Science Institute, Berkeley, CA;International Computer Science Institute, Berkeley, CA;Technical University of Catalonia, Barcelona, Spain
Venue:
MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Year:
2006

Citing 2
Cited 4

Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Further progress in meeting recognition: the ICSI-SRI spring 2005 speech-to-text evaluation system

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction

Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information

IEEE Transactions on Computers
On-line multi-modal speaker diarization

Proceedings of the 9th international conference on Multimodal interfaces
A Hybrid Generative-Discriminative Approach to Speaker Diarization

MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Robust speaker diarization for meetings: ICSI RT06S meetings evaluation system

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of speaker diarization is to determine where each participant speaks in a recording. One of the most commonly used technique is agglomerative clustering, where some number of initial models are grouped into the number of present speakers. The choice of complexity, topology, and the number of initial models is vital to the final outcome of the clustering algorithm. In prior systems, these parameters were directly assigned based on development data, and were the same for all recordings. In this paper we present three techniques to select the parameters individually for each case, obtaining a system that is more robust to changes in the data. Although the choice of these values depends on tunable parameters, they are less sensitive to changes in the acoustic data and to how the algorithm distributes data among the different clusters. We show that by using the three techniques, we achieve an improvement up to 8% relative in the development set and 19% relative in the test set over prior systems.