Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures

Authors:
Bertrand Rivet;Laurent Girin;Christian Jutten
Affiliations:
Inst. de la Commun. Parlee, Ecole Nationale d'Electronique et de Radioelectricite, Grenoble;-;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2007

Citing 0
Cited 5

Multi-modal Speech Processing Methods: An Overview and Future Research Directions Using a MATLAB Based Audio-Visual Toolbox

Multimodal Signals: Cognitive and Algorithmic Issues
Blind Non-stationnary Sources Separation by Sparsity in a Linear Instantaneous Mixture

ICA '09 Proceedings of the 8th International Conference on Independent Component Analysis and Signal Separation
Use of bimodal coherence to resolve spectral indeterminacy in Convolutive BSS

LVA/ICA'10 Proceedings of the 9th international conference on Latent variable analysis and signal separation
Use of bimodal coherence to resolve the permutation problem in convolutive BSS

Signal Processing
Multimodal speech separation

NOLISP'09 Proceedings of the 2009 international conference on Advances in Nonlinear Speech Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Looking at the speaker's face can be useful to better hear a speech signal in noisy environment and extract it from competing sources before identification. This suggests that the visual signals of speech (movements of visible articulators) could be used in speech enhancement or extraction systems. In this paper, we present a novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques. This algorithm is applied to the difficult and realistic case of convolutive mixtures. The algorithm mainly works in the frequency (transform) domain, where the convolutive mixture becomes an additive mixture for each frequency channel. Frequency by frequency separation is made by an audio BSS algorithm. The audio and visual informations are modeled by a newly proposed statistical model. This model is then used to solve the standard source permutation and scale factor ambiguities encountered for each frequency after the audio blind separation stage. The proposed method is shown to be efficient in the case of 2 times 2 convolutive mixtures and offers promising perspectives for extracting a particular speech source of interest from complex mixtures