Visual voice activity detection as a help for speech source separation from convolutive mixtures

  • Authors:
  • Bertrand Rivet;Laurent Girin;Christian Jutten

  • Affiliations:
  • Institut de la Communication Parlée (ICP), CNRS UMR 5009, INPG, Université Stendhal, Grenoble, France and Laboratoire des Images et des Signaux (LIS), CNRS UMR 5083, INPG, Université ...;Institut de la Communication Parlée (ICP), CNRS UMR 5009, INPG, Université Stendhal, Grenoble, France;Laboratoire des Images et des Signaux (LIS), CNRS UMR 5083, INPG, Université Joseph Fourier, Grenoble, France

  • Venue:
  • Speech Communication
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Audio-visual speech source separation consists in mixing visual speech processing techniques (e.g., lip parameters tracking) with source separation methods to improve the extraction of a speech source of interest from a mixture of acoustic signals. In this paper, we present a new approach that combines visual information with separation methods based on the sparseness of speech: visual information is used as a voice activity detector (VAD) which is combined with a new geometric method of separation. The proposed audio-visual method is shown to be efficient to extract a real spontaneous speech utterance in the difficult case of convolutive mixtures even if the competing sources are highly non-stationary. Typical gains of 18-20dB in signal to interference ratios are obtained for a wide range of (2x2) and (3x3) mixtures. Moreover, the overall process is computationally quite simpler than previously proposed audio-visual separation schemes.