A Bayesian approach to audio-visual speaker identification

Authors:
Ara V. Nefian;Lu Hong Liang;Tieyan Fu;Xiao Xing Liu
Affiliations:
Microprocessor Research Labs, Intel Corporation;Microprocessor Research Labs, Intel Corporation;Computer Science and Technology Department, Tsinghua University;Microprocessor Research Labs, Intel Corporation
Venue:
AVBPA'03 Proceedings of the 4th international conference on Audio- and video-based biometric person authentication
Year:
2003

Citing 8
Cited 7

Keyword Spotting in Poorly Printed Documents using Pseudo 2-D Hidden Markov Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection

ECCV '96 Proceedings of the 4th European Conference on Computer Vision-Volume I - Volume I
Coupled hidden Markov models for complex action recognition

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Image classification by a two-dimensional hidden Markov model

IEEE Transactions on Signal Processing
Audio-visual speech modeling for continuous speech recognition

IEEE Transactions on Multimedia
A review of speech-based bimodal recognition

IEEE Transactions on Multimedia
Matching pursuit filters applied to face identification

IEEE Transactions on Image Processing
Face recognition: a convolutional neural-network approach

IEEE Transactions on Neural Networks

A new lip feature representation method for video-based bimodal authentication

MMUI '05 Proceedings of the 2005 NICTA-HCSNet Multimodal User Interaction Workshop - Volume 57
Audio-visual speaker verification using continuous fused HMMs

VisHCI '06 Proceedings of the HCSNet workshop on Use of vision in human-computer interaction - Volume 56
Multi-stream Fusion for Speaker Classification

Speaker Classification I
Dynamic visual features for audio-visual speaker verification

Computer Speech and Language
Visual processing-inspired fern-audio features for noise-robust speaker verification

Proceedings of the 2010 ACM Symposium on Applied Computing
An information acquiring channel —— lip movement

ACII'05 Proceedings of the First international conference on Affective Computing and Intelligent Interaction
Multi-level fusion of audio and visual features for speaker identification

ICB'06 Proceedings of the 2006 international conference on Advances in Biometrics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we describe a text dependent audio-visual speaker identification approach that combines face recognition and audio-visual speech-based identification systems. The temporal sequence of audio and visual observations obtained from the acoustic speech and the shape of the mouth are modeled using a set of coupled hidden Markov models (CHMM), one for each phoneme-viseme pair and for each person in the database. The use of CHMM in our system is justified by the capability of this model to describe the natural audio and visual state asynchrony as well as their conditional dependence over time. Next, the likelihood obtained for each person in the database is combined with the face recognition likelihood obtained using an embedded hidden Markov model (EHMM). Experimental results on XM2VTS database show that our system improves the accuracy of the audio-only or video-only speaker identification at all levels of acoustic signal-to-noise ratio (SNR) from 5 to 30db.