Variational Probabilistic Speech Separation Using Microphone Arrays

  • Authors:
  • Steven J. Rennie;Parham Aarabi;Brendan J. Frey

  • Affiliations:
  • Edward S. Rogers Sr. Dept. of Comput. Eng., Univ. of Toronto, Ont.;-;-

  • Venue:
  • IEEE Transactions on Audio, Speech, and Language Processing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Separating multiple speech sources using a limited number of noisy sensor measurements presents a difficult problem, but one that is of great practical interest. Although previously introduced source separation methods [such as independent component analysis (ICA)] can be made to work in many situations, most of these methods fail when the sensors are very noisy or when the number of sources exceeds the number of sensors. Our approach to this problem is to combine the multiple sensor likelihoods [obtained using time-delay-of-arrival (TDOA) information] with a generative probability model of the sources. This model accounts for the power spectrum of each source using a mixture model, and accounts for the phase of each source using one discretized hidden phase variable for each frequency. Source separation is achieved by identifying the source vector configuration of maximum a posteriori probability, given all available information. An exhaustive search for the MAP configuration is computationally intractable, but we present an efficient variational technique that performs approximate probabilistic inference. For the problem of separating delayed additive noise corrupted speech mixtures, the algorithm is able to improve upon the signal-to-noise ratio (SNR) gain performance of existing state-of-the-art probabilistic and TDOA-based speech separation algorithms by over 10 dB. This significant performance improvement is obtained by combining the information utilized by these approaches intelligently under a representative probabilistic description of the speech production and mixing process. The method is capable of recovering high fidelity estimates of the underlying speech sources even when there are more sources than microphone observations