Fusing short term and long term features for improved speaker diarization

  • Authors:
  • A. Gerald Friedland;B. Oriol Vinyals;C. Yan Huang;D. Christian Muller

  • Affiliations:
  • Intern-l Computer Science Institute, 1947 Center Street Suite 600, Berkeley, CA, 94704, USA;Intern-l Computer Science Institute, 1947 Center Street Suite 600, Berkeley, CA, 94704, USA;Intern-l Computer Science Institute, 1947 Center Street Suite 600, Berkeley, CA, 94704, USA;German Research Center for AI, Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany

  • Venue:
  • ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The following article shows how a state-of-the-art speaker diarization system can be improved by combining traditional short-term features (MFCCs) with prosodic and other long-term features. First, we present a framework to study the speaker discriminability of 70 different long-term features. Then, we show how the top-ranked long-term features can be combined with short-term features to increase the accuracy of speaker diarization. The results were measured on standardized data sets (NIST RT) and show a consistent improvement of about 30% relative in diarization error rate compared to the best system presented at the NIST evaluation in 2007. This result was also verified on a wide set of meetings, which we call CombDev, that contains 21 meetings from previous evaluations. Since the prosodic and long-term features were selected using a diarization-independent speaker-discriminability study, we are confident that the same features are able to improve other systems that perform similar tasks