Joint speaker and environment adaptation using TensorVoice for robust speech recognition

Authors:
Yongwon Jeong
Affiliations:
-
Venue:
Speech Communication
Year:
2014

Citing 13
Cited 0

Assessment for automatic speech recognition II: NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems

Speech Communication - Special issue on speech processing in adverse conditions
Predictive model-based compensation schemes for robust speech recognition

Speech Communication - Special issue on robust speech recognition
A Multilinear Singular Value Decomposition

SIAM Journal on Matrix Analysis and Applications
Multilinear Analysis of Image Ensembles: TensorFaces

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part I
Multilinear Independent Components Analysis

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Face transfer with multilinear models

ACM SIGGRAPH 2005 Papers
The design for the wall street journal-based CSR corpus

HLT '91 Proceedings of the workshop on Speech and Natural Language
1993 benchmark tests for the ARPA spoken language program

HLT '94 Proceedings of the workshop on Human Language Technology
Eigenfaces for recognition

Journal of Cognitive Neuroscience
Invited paper: Automatic speech recognition: History, methods and challenges

Pattern Recognition
Tensor Decompositions and Applications

SIAM Review
An Ensemble Speaker and Speaking Environment Modeling Approach to Robust Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing
MPCA: Multilinear Principal Component Analysis of Tensor Objects

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an adaptation of a hidden Markov model (HMM)-based automatic speech recognition system to the target speaker and noise environment. Given HMMs built from various speakers and noise conditions, we build tensorvoices that capture the interaction between the speaker and noise by using a tensor decomposition. We express the updated model for the target speaker and noise environment as a product of the tensorvoices and two weight vectors, one each for the speaker and noise. An iterative algorithm is presented to determine the weight vectors in the maximum likelihood (ML) framework. With the use of separate weight vectors, the tensorvoice approach can adapt to the target speaker and noise environment differentially, whereas the eigenvoice approach, which is based on a matrix decomposition technique, cannot differentially adapt to those two factors. In supervised adaptation tests using the AURORA4 corpus, the relative improvement of performance obtained by the tensorvoice method over the eigenvoice method is approximately 10% on average for adaptation data of 6-24s in length, and the relative improvement of performance obtained by the tensorvoice method over the maximum likelihood linear regression (MLLR) method is approximately 5.4% on average for adaptation data of 6-18s in length. Therefore, the tensorvoice approach is an efficient method for speaker and noise adaptation.