Estimation of the joint probability of multisensory signals
Pattern Recognition Letters
A Bayesian approach to audio-visual speaker identification
AVBPA'03 Proceedings of the 4th international conference on Audio- and video-based biometric person authentication
A fused hidden Markov model with application to bimodal speech processing
IEEE Transactions on Signal Processing
A review of speech-based bimodal recognition
IEEE Transactions on Multimedia
Dynamic visual features for audio-visual speaker verification
Computer Speech and Language
Hi-index | 0.00 |
This paper examines audio-visual speaker verification using a novel adaptation of fused hidden Markov models, in comparison to output fusion of individual classifiers in the audio and video modalities. A comparison of both hidden Markov model (HMM) and Gaussian mixture model (GMM) classifiers in both modalities under output fusion shows that the choice of audio classifier is more important than video. Although temporal information allows a HMM to outperform a GMM individually in video, this temporal information does not carry through to output fusion with an audio classifier, where the difference between the two video classifiers is minor. An adaptation of fused hidden Markov models, designed to be more robust to within-speaker variation, is used to show that the temporal relationship between video observations and audio states can be harnessed to reduce errors in audio-visual speaker verification when compared to output fusion.