A Hybrid Generative-Discriminative Approach to Speaker Diarization

Authors:
Athanasios K. Noulas;Tim Kasteren;Ben J. Kröse
Affiliations:
University of Amsterdam, Amsterdam, The Netherlands 1098 SJ;University of Amsterdam, Amsterdam, The Netherlands 1098 SJ;University of Amsterdam, Amsterdam, The Netherlands 1098 SJ
Venue:
MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Year:
2008

Citing 7
Cited 0

A tutorial on hidden Markov models and selected applications in speech recognition

Readings in speech recognition
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Automatic labeling inconsistencies detection and correction for sentence unit segmentation in conversational speech

MLMI'07 Proceedings of the 4th international conference on Machine learning for multimodal interaction
Modeling vocal interaction for segmentation in meeting recognition

MLMI'07 Proceedings of the 4th international conference on Machine learning for multimodal interaction
Automatic cluster complexity and quantity selection: towards robust speaker diarization

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Robust heteroscedastic linear discriminant analysis and LCRC posterior features in meeting data recognition

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Juicer: a weighted finite-state transducer speech decoder

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present a sound probabilistic approach to speaker diarization. We use a hybrid framework where a distribution over the number of speakers at each point of a multimodal stream is estimated with a discriminative model. The output of this process is used as input in a generative model that can adapt to a novel test set and perform high accuracy speaker diarization. We manage to deal efficiently with the less common, and therefore harder, segments like silence and multiple speaker parts in a principled probabilistic manner.