Speech separation using speaker-adapted eigenvoice speech models

Authors:
Ron J. Weiss;Daniel P. W. Ellis
Affiliations:
LabROSA, Department of Electrical Engineering, Columbia University, 500 West 120th Street, Room 1300, Mailcode 4712, New York, NY 10027, United States;LabROSA, Department of Electrical Engineering, Columbia University, 500 West 120th Street, Room 1300, Mailcode 4712, New York, NY 10027, United States
Venue:
Computer Speech and Language
Year:
2010

Citing 1
Cited 6

Least squares quantization in PCM

IEEE Transactions on Information Theory

A Uniform Framework for Ad-Hoc Indexes to Answer Reachability Queries on Large Graphs

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Monaural speech separation and recognition challenge

Computer Speech and Language
Under-determined reverberant audio source separation using a full-rank spatial covariance model

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on processing reverberant speech: methodologies and applications
Combining localization cues and source model constraints for binaural source separation

Speech Communication
The Markov selection model for concurrent speech recognition

Neurocomputing
Analysis of two-sensors forward BSS structure with post-filters in the presence of coherent and incoherent noise

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a system for model-based source separation for use on single channel speech mixtures where the precise source characteristics are not known a priori. The sources are modeled using hidden Markov models (HMM) and separated using factorial HMM methods. Without prior speaker models for the sources in the mixture it is difficult to exactly resolve the individual sources because there is no way to determine which state corresponds to which source at any point in time. This is solved to a small extent by the temporal constraints provided by the Markov models, but permutations between sources remains a significant problem. We overcome this by adapting the models to match the sources in the mixture. We do this by representing the space of speaker variation with a parametric signal model-based on the eigenvoice technique for rapid speaker adaptation. We present an algorithm to infer the characteristics of the sources present in a mixture, allowing for significantly improved separation performance over that obtained using unadapted source models. The algorithm is evaluated on the task defined in the 2006 Speech Separation Challenge [Cooke, M.P., Lee, T.-W., 2008. The 2006 Speech Separation Challenge. Computer Speech and Language] and compared with separation using source-dependent models. Although performance is not as good as with speaker-dependent models, we show that the system based on model adaptation is able to generalize better to held out speakers.