Missing-feature reconstruction by leveraging temporal spectral correlation for robust speech recognition in background noise conditions

Authors:
Wooil Kim;John H. L. Hansen
Affiliations:
Center for Robust Speech Systems, Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas Richardson, TX;Center for Robust Speech Systems, Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas Richardson, TX
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2010

Citing 7
Cited 0

Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition

Speech Communication - Special issue on speech under stress
Data-driven environmental compensation for speech recognition: a unified approach

Speech Communication
Robust automatic speech recognition with missing and unreliable acoustic data

Speech Communication
Temporal patterns (TRAPs) in ASR of noisy speech

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Feature compensation in the cepstral domain employing model combination

Speech Communication
Time-frequency correlation-based missing-feature reconstruction for robust speech recognition in band-restricted conditions

IEEE Transactions on Audio, Speech, and Language Processing
Normalization of the Speech Modulation Spectra for Robust Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a novel missing-feature reconstruction method to improve speech recognition in background noise environments. The existing missing-feature reconstruction method utilizes log-spectral correlation across frequency bands. In this paper, we propose to employ a temporal spectral feature analysis to improve the missing-feature reconstruction performance by leveraging temporal correlation across neighboring frames. In a similar manner with the conventional method, a Gaussian mixture model is obtained by training over the obtained temporal spectral feature set. The final estimates for missing-feature reconstruction are obtained by a selective combination of the original frequency correlation based method and the proposed temporal correlation-based method. Performance of the proposed method is evaluated on the TIMIT speech corpus using various types of background noise conditions and the CU-Move in-vehicle speech corpus. Experimental results demonstrate that the proposed method is more effective at increasing speech recognition performance in adverse conditions. By employing the proposed temporal-frequency based reconstruction method, a + 17.71% average relative improvement in word error rate (WER) is obtained for white, car, speech babble, and background music conditions over 5-, 10-, and 15-dB SNR, compared to the original frequency correlation-based method. We also obtain a + 16.72% relative improvement in real-life in-vehicle conditions using data from the CU-Move corpus.