A study on the generalization capability of acoustic models for robust speech recognition

Authors:
Xiong Xiao;Jinyu Li;Eng Siong Chng;Haizhou Li;Chin-Hui Lee
Affiliations:
School of Computer Engineering, Nanyang Technological University, Singapore;Microsoft Corporation, Redmond, WA;School of Computer Engineering, Nanyang Technological University, Singapore;School of Computer Engineering, Nanyang Technological University, Singapore and Institute for Infocomm Research, Singapore;School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2010

Citing 13
Cited 3

Acoustical and environmental robustness in automatic speech recognition

Acoustical and environmental robustness in automatic speech recognition
Cepstral parameter compensation for HMM recognition in noise

Speech Communication - Special issue on speech processing in adverse conditions
Speech recognition in noisy environments: a survey

Speech Communication
The nature of statistical learning theory

The nature of statistical learning theory
MMIE training of large vocabulary recognition systems

Speech Communication
Cepstral domain segmental feature vector normalization for noise robust speech recognition

Speech Communication - Special issue on robust speech recognition
Speech recognition in noisy environments

Speech recognition in noisy environments
MVA Processing of Speech Features

IEEE Transactions on Audio, Speech, and Language Processing
Quantile based histogram equalization for noise robust large vocabulary speech recognition

IEEE Transactions on Audio, Speech, and Language Processing
Normalization of the Speech Modulation Spectra for Robust Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing
Large margin hidden Markov models for speech recognition

IEEE Transactions on Audio, Speech, and Language Processing
Approximate Test Risk Bound Minimization Through Soft Margin Estimation

IEEE Transactions on Audio, Speech, and Language Processing
An Environment-Compensated Minimum Classification Error Training Approach Based on Stochastic Vector Mapping

IEEE Transactions on Audio, Speech, and Language Processing

Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I)

International Journal of Speech Technology
Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II)

International Journal of Speech Technology
A MAP-based Online Estimation Approach to Ensemble Speaker and Speaking Environment Modeling

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we explore the generalization capability of acoustic model for improving speech recognition robustness against noise distortions. While generalization in statistical learning theory originally refers to the model's ability to generalize well on unseen testing data drawn from the same distribution as that of the training data, we show that good generalization capability is also desirable for mismatched cases. One way to obtain such general models is to use margin-based model training method, e.g., soft-margin estimation (SME), to enable some tolerance to acoustic mismatches without a detailed knowledge about the distortion mechanisms through enhancing margins between competing models. Experimental results on the Aurora-2 and Aurora-3 connected digit string recognition tasks demonstrate that, by improving the model's generalization capability through SME training, speech recognition performance can be significantly improved in both matched and low to medium mismatched testing cases with no language model constraints. Recognition results show that SME indeed performs better with than without mean and variance normalization, and therefore provides a complimentary benefit to conventional feature normalization techniques such that they can be combined to further improve the system performance. Although this study is focused on noisy speech recognition, we believe the proposed margin-based learning framework can be extended to dealing with different types of distortions and robustness issues in other machine learning applications.