Audio-visual speaker verification using continuous fused HMMs

Authors:
David Dean;Sridha Sridharan;Tim Wark
Affiliations:
Queensland University of Technology;Queensland University of Technology;Queensland University of Technology and CSIRO ICT Centre, Brisbane, Australia
Venue:
VisHCI '06 Proceedings of the HCSNet workshop on Use of vision in human-computer interaction - Volume 56
Year:
2006

Citing 4
Cited 1

Estimation of the joint probability of multisensory signals

Pattern Recognition Letters
A Bayesian approach to audio-visual speaker identification

AVBPA'03 Proceedings of the 4th international conference on Audio- and video-based biometric person authentication
A fused hidden Markov model with application to bimodal speech processing

IEEE Transactions on Signal Processing
A review of speech-based bimodal recognition

IEEE Transactions on Multimedia

Dynamic visual features for audio-visual speaker verification

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper examines audio-visual speaker verification using a novel adaptation of fused hidden Markov models, in comparison to output fusion of individual classifiers in the audio and video modalities. A comparison of both hidden Markov model (HMM) and Gaussian mixture model (GMM) classifiers in both modalities under output fusion shows that the choice of audio classifier is more important than video. Although temporal information allows a HMM to outperform a GMM individually in video, this temporal information does not carry through to output fusion with an audio classifier, where the difference between the two video classifiers is minor. An adaptation of fused hidden Markov models, designed to be more robust to within-speaker variation, is used to show that the temporal relationship between video observations and audio states can be harnessed to reduce errors in audio-visual speaker verification when compared to output fusion.