Multitaper MFCC and PLP features for speaker verification using i-vectors

Authors:
Md Jahangir Alam;Tomi Kinnunen;Patrick Kenny;Pierre Ouellet;Douglas O'Shaughnessy
Affiliations:
INRS-EMT, Montreal, Canada and CRIM, Montreal, Canada;School of Computing, University of Eastern Finland (UEF), Joensuu, Finland;CRIM, Montreal, Canada;CRIM, Montreal, Canada;INRS-EMT, Montreal, Canada
Venue:
Speech Communication
Year:
2013

Citing 10
Cited 0

Speech and Audio Signal Processing: Processing and Perception of Speech and Music

Speech and Audio Signal Processing: Processing and Perception of Speech and Music
Optimal cepstrum estimation using multiple windows

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
A study of low-variance multi-taper features for distributed speech recognition

NOLISP'11 Proceedings of the 5th international conference on Advances in nonlinear speech processing
Minimum bias multiple taper spectral estimation

IEEE Transactions on Signal Processing
The variance of multitaper spectrum estimates for real Gaussianprocesses

IEEE Transactions on Signal Processing
Multitaper spectral estimation of power law processes

IEEE Transactions on Signal Processing
A multiple window method for estimation of peaked spectra

IEEE Transactions on Signal Processing
Front-End Factor Analysis for Speaker Verification

IEEE Transactions on Audio, Speech, and Language Processing
Joint Factor Analysis Versus Eigenchannels in Speaker Recognition

IEEE Transactions on Audio, Speech, and Language Processing
Speaker and Session Variability in GMM-Based Speaker Verification

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we study the performance of the low-variance multi-taper Mel-frequency cepstral coefficient (MFCC) and perceptual linear prediction (PLP) features in a state-of-the-art i-vector speaker verification system. The MFCC and PLP features are usually computed from a Hamming-windowed periodogram spectrum estimate. Such a single-tapered spectrum estimate has large variance, which can be reduced by averaging spectral estimates obtained using a set of different tapers, leading to a so-called multi-taper spectral estimate. The multi-taper spectrum estimation method has proven to be powerful especially when the spectrum of interest has a large dynamic range or varies rapidly. Multi-taper MFCC features were also recently studied in speaker verification with promising preliminary results. In this study our primary goal is to validate those findings using an up-to-date i-vector classifier on the latest NIST 2010 SRE data. In addition, we also propose to compute robust perceptual linear prediction (PLP) features using multitapers. Furthermore, we provide a detailed comparison between different taper weight selections in the Thomson multi-taper method in the context of speaker verification. Speaker verification results on the telephone (det5) and microphone speech (det1, det2, det3 and det4) of the latest NIST 2010 SRE corpus indicate that the multi-taper methods outperform the conventional periodogram technique. Instead of simply averaging (using uniform weights) the individual spectral estimates in forming the multi-taper estimate, weighted averaging (using non-uniform weights) improves performance. Compared to the MFCC and PLP baseline systems, the sine-weighted cepstrum estimator (SWCE) based multitaper method provides average relative reductions of 12.3% and 7.5% in equal error rate, respectively. For the multi-peak multi-taper method, the corresponding reductions are 12.6% and 11.6%, respectively. Finally, the Thomson multi-taper method provides error reductions of 9.5% and 5.0% in EER for MFCC and PLP features, respectively. We conclude that both the MFCC and PLP features computed via multitapers provide systematic improvements in recognition accuracy.