Visual voice activity detection as a help for speech source separation from convolutive mixtures

Authors:
Bertrand Rivet;Laurent Girin;Christian Jutten
Affiliations:
Institut de la Communication Parlée (ICP), CNRS UMR 5009, INPG, Université Stendhal, Grenoble, France and Laboratoire des Images et des Signaux (LIS), CNRS UMR 5083, INPG, Université ...;Institut de la Communication Parlée (ICP), CNRS UMR 5009, INPG, Université Stendhal, Grenoble, France;Laboratoire des Images et des Signaux (LIS), CNRS UMR 5083, INPG, Université Joseph Fourier, Grenoble, France
Venue:
Speech Communication
Year:
2007

Citing 1
Cited 5

A time-frequency blind signal separation method applicable to underdetermined mixtures of dependent sources

Signal Processing

Blind Non-stationnary Sources Separation by Sparsity in a Linear Instantaneous Mixture

ICA '09 Proceedings of the 8th International Conference on Independent Component Analysis and Signal Separation
Voice activity detection using audio-visual information

DSP'09 Proceedings of the 16th international conference on Digital Signal Processing
An improvement in audio-visual voice activity detection for automatic speech recognition

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I
Multimodal speech separation

NOLISP'09 Proceedings of the 2009 international conference on Advances in Nonlinear Speech Processing
Robust visual speakingness detection using bi-level HMM

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Audio-visual speech source separation consists in mixing visual speech processing techniques (e.g., lip parameters tracking) with source separation methods to improve the extraction of a speech source of interest from a mixture of acoustic signals. In this paper, we present a new approach that combines visual information with separation methods based on the sparseness of speech: visual information is used as a voice activity detector (VAD) which is combined with a new geometric method of separation. The proposed audio-visual method is shown to be efficient to extract a real spontaneous speech utterance in the difficult case of convolutive mixtures even if the competing sources are highly non-stationary. Typical gains of 18-20dB in signal to interference ratios are obtained for a wide range of (2x2) and (3x3) mixtures. Moreover, the overall process is computationally quite simpler than previously proposed audio-visual separation schemes.