Cluster-based dynamic variance adaptation for interconnecting speech enhancement pre-processor and speech recognizer

Authors:
Marc Delcroix;Shinji Watanabe;Tomohiro Nakatani;Atsushi Nakamura
Affiliations:
NTT Communication Science Laboratories, NTT Corporation, 2-4, Hikaridai Seika-cho, Souraku-gun, Kyoto 619-0237, Japan;NTT Communication Science Laboratories, NTT Corporation, 2-4, Hikaridai Seika-cho, Souraku-gun, Kyoto 619-0237, Japan;NTT Communication Science Laboratories, NTT Corporation, 2-4, Hikaridai Seika-cho, Souraku-gun, Kyoto 619-0237, Japan;NTT Communication Science Laboratories, NTT Corporation, 2-4, Hikaridai Seika-cho, Souraku-gun, Kyoto 619-0237, Japan
Venue:
Computer Speech and Language
Year:
2013

Citing 10
Cited 1

Speech recognition in noisy environments using first-order vector Taylor series

Speech Communication
Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition

Computer Speech and Language
Static and dynamic variance compensation for recognition of reverberant speech with dereverberation preprocessing

IEEE Transactions on Audio, Speech, and Language Processing
Robust Speech Recognition of Uncertain or Missing Data: Theory and Applications

Robust Speech Recognition of Uncertain or Missing Data: Theory and Applications
Blind separation of speech mixtures via time-frequency masking

IEEE Transactions on Signal Processing
Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error

IEEE Transactions on Audio, Speech, and Language Processing
A Novel Uncertainty Decoding Rule With Applications to Transmission Error Robust Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing
Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing
Blind Separation and Dereverberation of Speech Mixtures by Joint Optimization

IEEE Transactions on Audio, Speech, and Language Processing
Joint Uncertainty Decoding With Predictive Methods for Noise Robust Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing

Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

A conventional approach to noise robust speech recognition consists of employing a speech enhancement pre-processor prior to recognition. However, such a pre-processor usually introduces artifacts that limit recognition performance improvement. In this paper we discuss a framework for improving the interconnection between speech enhancement pre-processors and a recognizer. The framework relies on recent proposals for increasing robustness by replacing the point estimate of the enhanced features with a distribution with a dynamic (i.e. time varying) feature variance. We have recently proposed a model for the dynamic feature variance consisting of a dynamic feature variance root obtained from the pre-processor, which is multiplied by a weight representing the pre-processor uncertainty, and that uses adaptation data to optimize the pre-processor uncertainty weight. The formulation of the method is general and could be used with any speech enhancement pre-processor. However, we observed that in case of noise reduction based on spectral subtraction or related approaches, adaptation could fail because the proposed model is weak at representing well the actual dynamic feature variance. The dynamic feature variance changes according to the level of speech sound, which varies with the HMM states. Therefore, we propose improving the model by introducing HMM state dependency. We achieve this by using a cluster-based representation, i.e. the Gaussians of the acoustic model are grouped into clusters and a different pre-processor uncertainty weight is associated with each cluster. Experiments with various pre-processors and recognition tasks prove the generality of the proposed integration scheme and show that the proposed extension improves the performance with various speech enhancement pre-processors.