Stereo-based stochastic mapping for robust speech recognition

Authors:
Mohamed Afify;Xiaodong Cui;Yuqing Gao
Affiliations:
Orange Lab, Cairo, Egypt;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2009

Citing 7
Cited 2

Acoustical and environmental robustness in automatic speech recognition

Acoustical and environmental robustness in automatic speech recognition
Speech recognition in noisy environments: a survey

Speech Communication
On stochastic feature and model compensation approaches to robust speech recognition

Speech Communication - Special issue on robust speech recognition
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Subband-Based Speech Recognition

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Word recognition in the car-speech enhancement/spectral transformations

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
A vector Taylor series approach for environment-independent speech recognition

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 02

Enhancing speech at very low signal-to-noise ratios using non-acoustic reference signals

Speech Communication
A fast maximum likelihood nonlinear feature transformation method for GMM-HMM speaker adaptation

Neurocomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

We present a stochastic mapping technique for robust speech recognition that uses stereo data. The idea is based on constructing a Gaussian mixture model for the joint distribution of the clean and noisy features and using this distribution to predict the clean speech during testing. The proposed mapping is called stereo-based stochastic mapping (SSM). Two different estimators are considered. One is iterative and is based on the maximum a posteriori (MAP) criterion while the other uses the minimum mean square error (MMSE) criterion. The resulting estimators are effectively a mixture of linear transforms weighted by component posteriors, and the parameters of the linear transformations are derived from the joint distribution. Compared to the uncompensated result, the proposed method results in 45% relative improvement in word error rate (WER) for digit recognition in the car. In the same setting, SSM outperforms SPLICE and gives similar results to MMSE compensation of Huang et al. A 66% relative improvement in word error rate (WER) is observed when applied in conjunction with multistyle training (MST) for large vocabulary English speech recognition in a real environment. Also, the combination of the proposed mapping with CMLLR leads to about 38% relative improvement in performance compared to CMLLR alone for real field data.