Speech enhancement using transient speech components

  • Authors:
  • J. R. Boston;Charturong (Paul) Tantibundhit

  • Affiliations:
  • University of Pittsburgh;University of Pittsburgh

  • Venue:
  • Speech enhancement using transient speech components
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

We believe that the auditory system, like the visual system, may be sensitive to abrupt stimulus changes and the transient component in speech may be particularly critical to speech perception. If this component can be identified and selectively amplified, improved speech perception in background noise may be possible. This project describes a method to decompose speech into tonal, transient, and residual components. The modified discrete cosine transform (MDCT) and the wavelet transform are transforms used to capture tonal and transient features in speech. The tonal and transient components were identified by using a small number of MDCT and wavelet coefficients, respectively. In previous studies, all of the MDCT and all of the wavelet coefficients were assumed to be independent, and identifications of the significant MDCT and the significant wavelet coefficients were achieved by thresholds. However, an appropriate threshold is not known and the MDCT and the wavelet coefficients show statistical dependencies, described by the clustering and persistence properties. In this work, the hidden Markov chain (HMC) model and the hidden Markov tree (HMT) model were applied to describe the clustering and persistence properties between the MDCT coefficients and between the wavelet coefficients. The MDCT coefficients in each frequency index were modeled as a two-state mixture of two univariate Gaussian distributions. The wavelet coefficients in each scale of each tree were modeled as a two-state mixture of two univariate Gaussian distributions. The initial parameters of Gaussian mixtures were estimated by the greedy EM algorithm. By utilizing the Viterbi and the MAP algorithms used to find the optimal state distribution, the significant MDCT and the significant wavelet coefficients were determined without relying on a threshold. The transient component isolated by our method was selectively amplified and recombined with the original speech to generate enhanced speech, with energy adjusted to equal to the energy of the original speech. The intelligibility of the original and enhanced speech was evaluated in eleven human subjects using the modified rhyme protocol. Word recognition rate results show that the enhanced speech can improve speech intelligibility at low SNR levels (8% at -15 dB, 14% at -20 dB, and 18% at -25 dB).* *This dissertation is a compound document (contains both a paper copy and a CD as part of the dissertation). The CD requires the following system requirements: Windows MediaPlayer or RealPlayer.