The ICSI RT07s Speaker Diarization System

Authors:
Chuck Wooters;Marijn Huijbregts
Affiliations:
International Computer Science Institute, Berkeley, USA CA 94704;International Computer Science Institute, Berkeley, USA CA 94704 and Department of Electrical Engineering, Mathematics and Computer Science, University of Twente, Enschede, The Netherlands
Venue:
Multimodal Technologies for Perception of Humans
Year:
2008

Citing 2
Cited 7

Extrapolation, Interpolation, and Smoothing of Stationary Time Series

Extrapolation, Interpolation, and Smoothing of Stationary Time Series
Robust speaker diarization for meetings: ICSI RT06S meetings evaluation system

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction

The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System

Multimodal Technologies for Perception of Humans
Narrative theme navigation for sitcoms supported by fan-generated scripts

Proceedings of the 3rd international workshop on Automated information extraction in media production
Towards automatic speaker retrieval for large multimedia archives

Proceedings of the 3rd international workshop on Automated information extraction in media production
Tuning-robust initialization methods for speaker diarization

IEEE Transactions on Audio, Speech, and Language Processing
Speaker diarization exploiting the eigengap criterion and cluster ensembles

IEEE Transactions on Audio, Speech, and Language Processing
Narrative theme navigation for sitcoms supported by fan-generated scripts

Multimedia Tools and Applications
Singing speaker clustering based on subspace learning in the GMM mean supervector space

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present the ICSI speaker diarization system. This system was used in the 2007 National Institute of Standards and Technology (NIST) Rich Transcription evaluation. The ICSI system automatically performs both speaker segmentation and clustering without any prior knowledge of the identities or the number of speakers. Our system uses "standard" speech processing components and techniques such as HMMs, agglomerative clustering, and the Bayesian Information Criterion. However, we have developed the system with an eye towards robustness and ease of portability. Thus we have avoided the use of any sort of model that requires training on "outside" data and we have attempted to develop algorithms that require as little tuning as possible.The system is simular to last year's system [1] except for three aspects. We used the most recent available version of the beam-forming toolkit, we implemented a new speech/non-speech detector that does not require models trained on meeting data and we performed our development on a much larger set of recordings.