Rapid speaker adaptation in latent speaker space with non-negative matrix factorization

Authors:
Xueru Zhang;Kris Demuynck;Hugo Van Hamme
Affiliations:
-;-;-
Venue:
Speech Communication
Year:
2013

Citing 8
Cited 0

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Speaker Adaptive Training: A Maximum Likelihood Approach to Speaker Normalization

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Relation between PLSA and NMF and implications

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Maximum a posteriori adaptation for large scale HMM recognizers

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 02
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A novel speaker adaptation algorithm based on Gaussian mixture weight adaptation is described. A small number of latent speaker vectors are estimated with non-negative matrix factorization (NMF). These latent vectors encode the distinctive systematic patterns of Gaussian usage observed when modeling the individual speakers that make up the training data. Expressing the speaker dependent Gaussian mixture weights as a linear combination of a small number of latent vectors reduces the number of parameters that must be estimated from the enrollment data. The resulting fast adaptation algorithm, using 3s of enrollment data only, achieves similar performance as fMLLR adapting on 100+s of data. In order to learn richer Gaussian usage patterns from the training data, the NMF-based weight adaptation is combined with vocal tract length normalization (VTLN) and speaker adaptive training (SAT), or with a simple Gaussian exponentiation scheme that lowers the dynamic range of the Gaussian likelihoods. Evaluation on the Wall Street Journal tasks shows a 5% relative word error rate (WER) reduction over the speaker independent recognition system which already incorporates VTLN. The WER can be lowered further by combining weight adaptation with Gaussian mean adaptation by means of eigenvoice speaker adaptation.