A Bayesian approach to audio-visual speaker identification

  • Authors:
  • Ara V. Nefian;Lu Hong Liang;Tieyan Fu;Xiao Xing Liu

  • Affiliations:
  • Microprocessor Research Labs, Intel Corporation;Microprocessor Research Labs, Intel Corporation;Computer Science and Technology Department, Tsinghua University;Microprocessor Research Labs, Intel Corporation

  • Venue:
  • AVBPA'03 Proceedings of the 4th international conference on Audio- and video-based biometric person authentication
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we describe a text dependent audio-visual speaker identification approach that combines face recognition and audio-visual speech-based identification systems. The temporal sequence of audio and visual observations obtained from the acoustic speech and the shape of the mouth are modeled using a set of coupled hidden Markov models (CHMM), one for each phoneme-viseme pair and for each person in the database. The use of CHMM in our system is justified by the capability of this model to describe the natural audio and visual state asynchrony as well as their conditional dependence over time. Next, the likelihood obtained for each person in the database is combined with the face recognition likelihood obtained using an embedded hidden Markov model (EHMM). Experimental results on XM2VTS database show that our system improves the accuracy of the audio-only or video-only speaker identification at all levels of acoustic signal-to-noise ratio (SNR) from 5 to 30db.