Improved audio-visual speaker recognition via the use of a hybrid combination strategy

Authors:
Simon Lucey;Tsuhan Chen
Affiliations:
Advanced Multimedia Processing Laboratory, Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA;Advanced Multimedia Processing Laboratory, Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA
Venue:
AVBPA'03 Proceedings of the 4th international conference on Audio- and video-based biometric person authentication
Year:
2003

Citing 3
Cited 2

Introduction to statistical pattern recognition (2nd ed.)

Introduction to statistical pattern recognition (2nd ed.)
Improved facial-feature detection for AVSP via unsupervised clustering and discriminant analysis

EURASIP Journal on Applied Signal Processing
Audio-visual speech modeling for continuous speech recognition

IEEE Transactions on Multimedia

Audio, video and multimodal person identification in a smart room

CLEAR'06 Proceedings of the 1st international evaluation conference on Classification of events, activities and relationships
Histogram equalization in SVM multimodal person verification

ICB'07 Proceedings of the 2007 international conference on Advances in Biometrics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper an in depth analysis is undertaken into effective strategies for integrating the audio-visual modalities for the purposes of text-dependent speaker recognition. Our work is based around the well known hidden Markov model (HMM) classifier framework for modelling speech. A framework is proposed to handle the mismatch between train and test observation sets, so as to provide effective classifier combination performance between the acoustic and visual HMM classifiers. From this framework, it can be shown that strategies for combining independent classifiers, such as the weighted product or sum rules, naturally emerge depending on the influence of the mismatch. Based on the assumption that poor performance in most audio-visual speaker recognition applications can be attributed to train/test mismatches we propose that the main impetus of practical audio-visual integration is to dampen the independent errors, resulting from the mismatch, rather than trying to model any bimodal speech dependencies. To this end a strategy is recommended, based on theory and empirical evidence, using a hybrid between the weighted product and weighted sum rules in the presence of varying acoustic noise. Results are presented on the M2VTS database.