Issues with uncertainty decoding for noise robust automatic speech recognition

Authors:
H. Liao;M. J. F. Gales
Affiliations:
Engineering Department, Cambridge University, Trumpington Street, Cambridge CB2 1PZ, United Kingdom;Engineering Department, Cambridge University, Trumpington Street, Cambridge CB2 1PZ, United Kingdom
Venue:
Speech Communication
Year:
2008

Citing 1
Cited 4

Speech recognition in noisy environments using first-order vector Taylor series

Speech Communication

Sparse imputation for large vocabulary noise robust ASR

Computer Speech and Language
A hearing-inspired approach for distant-microphone speech recognition in the presence of multiple sources

Computer Speech and Language
Integration of beamforming and uncertainty-of-observation techniques for robust ASR in multi-source environments

Computer Speech and Language
Estimating Uncertainty to Improve Exemplar-Based Feature Enhancement for Noise Robust Speech Recognition

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Interest continues in a class of robustness algorithms for speech recognition that exploit the notion of uncertainty introduced by environmental noise. These techniques share the property that the uncertainty varies with the noise level and is propagated to the decoding stage, resulting in increased model variances. In observation uncertainty forms, the uncertainty variance is simply the variance of the error in enhancement that is added to the model variances. Another form, called uncertainty decoding, refers to a factorisation which results in a linear feature transform and model variance bias that increases with noise; using appropriate approximations, efficient implementations may be obtained, with the goal of achieving near model-based performance without the associated computational cost. Unfortunately, uncertainty decoding forms that compute the uncertainty in the front-end and pass this to the decoder may suffer from a theoretical problem in low signal-to-noise ratio conditions. This report discusses how this fundamental issue arises, and demonstrates it through two schemes: SPLICE with uncertainty and front-end joint uncertainty decoding (FE-Joint). A method to mitigate this for FE-Joint compensation is presented, as well as how SPLICE implicitly addresses it. However, it is shown that a model-based joint uncertainty decoding approach does not suffer from this limitation, like these front-end forms do, and is more computationally attractive. The issues described and performance of the various schemes are examined on two artificially corrupted corpora: the AURORA 2.0 digit string recognition and 1000-word Resource Management tasks.