Prosodic and other Long-Term Features for Speaker Diarization

Authors:
G. Friedland;O. Vinyals;Yan Huang;C. Muller
Affiliations:
Int. Comput. Sci. Inst., Berkeley, CA;-;-;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2009

Citing 0
Cited 6

Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Multimodal location estimation

Proceedings of the international conference on Multimedia
Tuning-robust initialization methods for speaker diarization

IEEE Transactions on Audio, Speech, and Language Processing
Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features

Speech Communication
Sherlock holmes' evil twin: on the impact of global inference for online privacy

Proceedings of the 2011 workshop on New security paradigms workshop
A review on speaker diarization systems and approaches

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Speaker diarization is defined as the task of determining ldquowho spoke whenrdquo given an audio track and no other prior knowledge of any kind. The following article shows how a state-of-the-art speaker diarization system can be improved by combining traditional short-term features (MFCCs) with prosodic and other long-term features. First, we present a framework to study the speaker discriminability of 70 different long-term features. Then, we show how the top-ranked long-term features can be combined with short-term features to increase the accuracy of speaker diarization. The results were measured on standardized datasets (NIST RT) and show a consistent improvement of about 30% relative in diarization error rate compared to the best system presented at the NIST evaluation in 2007.