The IBM RT07 Evaluation Systems for Speaker Diarization on Lecture Meetings

Authors:
Jing Huang;Etienne Marcheret;Karthik Visweswariah;Gerasimos Potamianos
Affiliations:
IBM Thomas J. Watson Research Center, U.S.A. NY 10598;IBM Thomas J. Watson Research Center, U.S.A. NY 10598;IBM Thomas J. Watson Research Center, U.S.A. NY 10598;IBM Thomas J. Watson Research Center, U.S.A. NY 10598
Venue:
Multimodal Technologies for Perception of Humans
Year:
2008

Citing 8
Cited 0

The rich transcription 2006 spring meeting recognition evaluation

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
The IBM RT06s evaluation system for speech activity detection in CHIL seminars

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Robust speaker diarization for meetings: ICSI RT06S meetings evaluation system

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Technical improvements of the E-HMM based speaker diarization system for meeting records

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
The AMI speaker diarization system for NIST RT06s meeting data

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Speaker diarization: from broadcast news to lectures

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
The IBM rich transcription spring 2006 speech-to-text system for lecture meetings

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Advances in speech transcription at IBM under the DARPA EARS program

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the IBM systems for the Rich Transcription 2007 (RT07) speaker diarization evaluation task on lecture meeting data. We first overview our baseline system that was developed last year, as part of our speech-to-text system for the RT06s evaluation. We then present a number of simple schemes considered this year in our effort to improve speaker diarization performance, namely: (i) A better speech activity detection (SAD) system, a necessary pre-processing step to speaker diarization; (ii) Use of word information from a speaker-independent speech recognizer; (iii) Modifications to speaker cluster merging criteria and the underlying segment model; and (iv) Use of speaker models based on Gaussian mixture models, and their iterative refinement by frame-level re-labeling and smoothing of decision likelihoods. We report development experiments on the RT06s evaluation test set that demonstrate that these methods are effective, resulting in dramatic performance improvements over our baseline diarization system. For example, changes in the cluster segment models and cluster merging methodology result in a 24.2% relative reduction in speaker error rate, whereas use of the iterative model refinement process and word-level alignment produce a 36.0% and 9.2% speaker error relative reduction, respectively. The importance of the SAD subsystem is also shown, with SAD error reduction from 12.3% to 4.3% translating to a 20.3% relative reduction in speaker error rate. Unfortunately however, the developed diarization system heavily depends on appropriately tuning thresholds in the speaker cluster merging process. Possibly as a result of over-tuning such thresholds, performance on the RT07 evaluation test set degrades significantly compared to the one observed on development data. Nevertheless, our experiments show that the introduced techniques of cluster merging, speaker model refinement and alignment remain valuable in the RT07 evaluation.