Mask classification for missing-feature reconstruction for robust speech recognition in unknown background noise

  • Authors:
  • Wooil Kim;Richard M. Stern

  • Affiliations:
  • Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering and Computer Science, Department of Electrical Engineering, University of Texas at Dallas, 2601 N. Floyd Road, EC33, Ric ...;Department of Electrical and Computer Engineering and Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA

  • Venue:
  • Speech Communication
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

''Missing-feature'' techniques to improve speech recognition accuracy are based on the blind determination of which cells in a spectrogram-like display of speech are corrupted by the effects of noise or other types of disturbance (and hence are ''missing''). In this paper we present three new approaches that improve the speech recognition accuracy obtained using missing-feature techniques. It had been found in previous studies (e.g. Seltzer et al., 2004) that Bayesian approaches to missing-feature classification are effective in ameliorating the effects of various types of additive noise. While Seltzer et al. primarily used white noise for training their Bayesian classifier, we have found that this is not the best type of training signal when noise with greater spectral and/or temporal variation is encountered in the testing environment. The first innovation introduced in this paper, referred to as frequency-dependent classification, involves independent classification in each of the various frequency bands in which the incoming speech is analyzed based on parallel sets of frequency-dependent features. The second innovation, referred to as colored-noise generation using multi-band partitioning, involves the use of masking noises with artificially-introduced spectral and temporal variation in training the Bayesian classifier used to determine which spectro-temporal components of incoming speech are corrupted by noise in unknown testing environments. The third innovation consists of an adaptive method to estimate the a priori values of the mask classifier that determines whether a particular time-frequency segment of the test data should be considered to be reliable or not. It is shown that these innovations provide improved speech recognition accuracy on a small vocabulary test when missing-feature restoration is applied to incoming speech that is corrupted by additive noise of an unknown nature, especially at lower signal-to-noise ratios.