DBN based models for audio-visual speech analysis and recognition

  • Authors:
  • Ilse Ravyse;Dongmei Jiang;Xiaoyue Jiang;Guoyun Lv;Yunshu Hou;Hichem Sahli;Rongchun Zhao

  • Affiliations:
  • Department ETRO, Joint Research Group on Audio Visual Signal Processing (AVSP), Vrije Universiteit Brussel, Brussel;School of Computer Science, Northwestern Polytechnical University, Xi’an, P.R. China;School of Computer Science, Northwestern Polytechnical University, Xi’an, P.R. China;School of Computer Science, Northwestern Polytechnical University, Xi’an, P.R. China;School of Computer Science, Northwestern Polytechnical University, Xi’an, P.R. China;Department ETRO, Joint Research Group on Audio Visual Signal Processing (AVSP), Vrije Universiteit Brussel, Brussel;School of Computer Science, Northwestern Polytechnical University, Xi’an, P.R. China

  • Venue:
  • PCM'06 Proceedings of the 7th Pacific Rim conference on Advances in Multimedia Information Processing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present an audio-visual automatic speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system consists of three components: (i) a visual module, (ii) an acoustic module, and (iii) a Dynamic Bayesian Network-based recognition module. The vision module, locates and tracks the speaker head, and mouth movements and extracts relevant speech features represented by contour information and 3D deformations of lip movements. The acoustic module extracts noise-robust features, i.e. the Mel Filterbank Cepstrum Coefficients (MFCCs). Finally we propose two models based on Dynamic Bayesian Networks (DBN) to either consider the single audio and video streams or to integrate the features from the audio and visual streams. We also compare the proposed DBN based system with classical Hidden Markov Model. The novelty of the developed framework is the persistence of the audiovisual speech signal characteristics from the extraction step, through the learning step. Experiments on continuous audiovisual speech show that the segmentation boundaries of phones in the audio stream and visemes in the video stream are close to manual segmentation boundaries.