Speech and Audio Signal Processing: Processing and Perception of Speech and Music
Speech and Audio Signal Processing: Processing and Perception of Speech and Music
Optimal cepstrum estimation using multiple windows
ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
A study of low-variance multi-taper features for distributed speech recognition
NOLISP'11 Proceedings of the 5th international conference on Advances in nonlinear speech processing
Minimum bias multiple taper spectral estimation
IEEE Transactions on Signal Processing
The variance of multitaper spectrum estimates for real Gaussianprocesses
IEEE Transactions on Signal Processing
Multitaper spectral estimation of power law processes
IEEE Transactions on Signal Processing
A multiple window method for estimation of peaked spectra
IEEE Transactions on Signal Processing
Front-End Factor Analysis for Speaker Verification
IEEE Transactions on Audio, Speech, and Language Processing
Joint Factor Analysis Versus Eigenchannels in Speaker Recognition
IEEE Transactions on Audio, Speech, and Language Processing
Speaker and Session Variability in GMM-Based Speaker Verification
IEEE Transactions on Audio, Speech, and Language Processing
Hi-index | 0.00 |
In this paper we study the performance of the low-variance multi-taper Mel-frequency cepstral coefficient (MFCC) and perceptual linear prediction (PLP) features in a state-of-the-art i-vector speaker verification system. The MFCC and PLP features are usually computed from a Hamming-windowed periodogram spectrum estimate. Such a single-tapered spectrum estimate has large variance, which can be reduced by averaging spectral estimates obtained using a set of different tapers, leading to a so-called multi-taper spectral estimate. The multi-taper spectrum estimation method has proven to be powerful especially when the spectrum of interest has a large dynamic range or varies rapidly. Multi-taper MFCC features were also recently studied in speaker verification with promising preliminary results. In this study our primary goal is to validate those findings using an up-to-date i-vector classifier on the latest NIST 2010 SRE data. In addition, we also propose to compute robust perceptual linear prediction (PLP) features using multitapers. Furthermore, we provide a detailed comparison between different taper weight selections in the Thomson multi-taper method in the context of speaker verification. Speaker verification results on the telephone (det5) and microphone speech (det1, det2, det3 and det4) of the latest NIST 2010 SRE corpus indicate that the multi-taper methods outperform the conventional periodogram technique. Instead of simply averaging (using uniform weights) the individual spectral estimates in forming the multi-taper estimate, weighted averaging (using non-uniform weights) improves performance. Compared to the MFCC and PLP baseline systems, the sine-weighted cepstrum estimator (SWCE) based multitaper method provides average relative reductions of 12.3% and 7.5% in equal error rate, respectively. For the multi-peak multi-taper method, the corresponding reductions are 12.6% and 11.6%, respectively. Finally, the Thomson multi-taper method provides error reductions of 9.5% and 5.0% in EER for MFCC and PLP features, respectively. We conclude that both the MFCC and PLP features computed via multitapers provide systematic improvements in recognition accuracy.