Time-frequency correlation-based missing-feature reconstruction for robust speech recognition in band-restricted conditions

Authors:
Wooil Kim;John H. L. Hansen
Affiliations:
Center for Robust Speech Systems, Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas, Richardson, TX;Center for Robust Speech Systems, Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas, Richardson, TX
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2009

Citing 9
Cited 3

Data-driven environmental compensation for speech recognition: a unified approach

Speech Communication
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Robust automatic speech recognition with missing and unreliable acoustic data

Speech Communication
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Discrete Time Processing of Speech Signals

Discrete Time Processing of Speech Signals
Progress in Broadcast News transcription at Dragon Systems

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Recent improvements to IBM's speech recognition system for automatic transcription of broadcast news

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
The Cambridge University spoken document retrieval system

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Bandwidth extension of speech signals: a catalyst for the introduction of wideband speech coding?

IEEE Communications Magazine

Time-frequency correlation based missing-feature reconstruction for robust speech recognition in background noise conditions

Asilomar'09 Proceedings of the 43rd Asilomar conference on Signals, systems and computers
Missing-feature reconstruction by leveraging temporal spectral correlation for robust speech recognition in background noise conditions

IEEE Transactions on Audio, Speech, and Language Processing
Variational noise model composition through model perturbation for robust speech recognition with time-varying background noise

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Band-limited speech represents one of the most challenging factors for robust speech recognition. This is especially true in supporting audio corpora from sources that have a range of conditions in spoken document retrieval requiring effective automatic speech recognition. The missing-feature reconstruction method has a problem when applied to band-limited speech reconstruction, since it assumes the observations in the unreliable regions are always greater than the latent original clean speech. The approach developed here depends only on reliable components to calculate the posterior probability to mitigate the problem. This study proposes an advanced method to effectively utilize the correlation information of the spectral components across time and frequency axes in an effort to increase the performance of missing-feature reconstruction in band-limited conditions. We employ an F1 Area Window and Cutoff Border Window in order to include more knowledge on reliable components which are highly correlated with the cutoff frequency band. To detect the cutoff regions for missing-feature reconstruction, blind mask estimation is also presented, which employs the synthesized band-limited speech model without secondary training data. Experiments to evaluate the performance of the proposed methods are accomplished using the SPHINX3 speech recognition engine and the TIMIT corpus. Experimental results demonstrate that the proposed time-frequency (TF) correlation based missing-feature reconstruction method is significantly more effective in improving band-limited speech recognition accuracy. By employing the proposed TF-missing feature reconstruction method, we obtain up to 14.61% of average relative improvement in word error rate (WER) for four available bandwidths with cutoff frequencies 1.0, 1.5, 2.0, and 2.5 kHz, respectively, compared to earlier formulated methods. Experimental results on the National Gallery of the Spoken Word (NGSW) corpus also show the proposed method is effective in improving band-limited speech recognition in real-life spoken document conditions.