Combining localization cues and source model constraints for binaural source separation

Authors:
Ron J. Weiss;Michael I. Mandel;Daniel P. W. Ellis
Affiliations:
LabROSA, Dept. of Electrical Engineering, Columbia University, New York, NY 10027, USA;LabROSA, Dept. of Electrical Engineering, Columbia University, New York, NY 10027, USA;LabROSA, Dept. of Electrical Engineering, Columbia University, New York, NY 10027, USA
Venue:
Speech Communication
Year:
2011

Citing 6
Cited 0

Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures

ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 05
Monaural speech separation and recognition challenge

Computer Speech and Language
Speech separation using speaker-adapted eigenvoice speech models

Computer Speech and Language
Model-based expectation-maximization source separation and localization

IEEE Transactions on Audio, Speech, and Language Processing
Blind separation of speech mixtures via time-frequency masking

IEEE Transactions on Signal Processing
Mask estimation for missing data speech recognition based on statistics of binaural interaction

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a system for separating multiple sources from a two-channel recording based on interaural cues and prior knowledge of the statistics of the underlying source signals. The proposed algorithm effectively combines information derived from low level perceptual cues, similar to those used by the human auditory system, with higher level information related to speaker identity. We combine a probabilistic model of the observed interaural level and phase differences with a prior model of the source statistics and derive an EM algorithm for finding the maximum likelihood parameters of the joint model. The system is able to separate more sound sources than there are observed channels in the presence of reverberation. In simulated mixtures of speech from two and three speakers the proposed algorithm gives a signal-to-noise ratio improvement of 1.7dB over a baseline algorithm which uses only interaural cues. Further improvement is obtained by incorporating eigenvoice speaker adaptation to enable the source model to better match the sources present in the signal. This improves performance over the baseline by 2.7dB when the speakers used for training and testing are matched. However, the improvement is minimal when the test data is very different from that used in training.