Speech separation by efficient combinatorial decoding of speech mixtures

Authors:
Manuel Reyes-Gomez;Nebojsa Jojic
Affiliations:
MSN Applied Research, Redmond, WA;Microsoft Research, Redmond, WA
Venue:
ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Year:
2009

Citing 2
Cited 0

A view of the EM algorithm that justifies incremental, sparse, and other variants

Learning in graphical models
A note on the minimization of symmetric and general submodular functions

Discrete Applied Mathematics - Submodularity

Quantified Score

Hi-index	0.00

Visualization

Abstract

We formulate the cocktail party problem as the minimization of a symmetric posimodular function defined on fragments of the signal captured by a single microphone. This formulation allows the application of tractable combinatorial optimization techniques, and in particular the Queyranne's algorithm [1], to exactly solve a problem which was previously considered exponential in the size of the signal, and was typically addressed by greedy search or posterior distribution approximations. While the main idea described in the paper may be be applicable to a variety of signal segmentation problems (e.g., image or video segmentation), we focus here on unsupervised separation of sources in mixed speech signals recorded by a single microphone. As the optimization criterion we use the likelihood under a generative model which assumes that each time-frequency bin is assigned to one of the two speakers, and that each speaker's utterance has been generated from the same generic speech model. (This assumption has previously been motivated by the sparsity of the time-frequency representation, making it unlikely that more than one speaker would dominate any given time-frequency bin.) The partition of the time-frequency space that maximizes the likelihood under the model corresponds to the one for which the resultant decoded speech of each independent source has the highest combined likelihood. The exact search over all possible assignments of the time-frequency bins to the two speakers is performed in polynomial time. Further speedups are achievable by presegmenting the spectrogram into a large number of small segments which do not violate the deformable spectrogram model [2]. We show that this technique leads to blind separation of mixed signals where the two speakers have identical spectral characteristics, opening up a variety of possible applications in teleconferencing and telephony.