Audio-Visual speaker identification via adaptive fusion using reliability estimates of both modalities

Authors:
Niall A. Fox;Brian A. O'Mullane;Richard B. Reilly
Affiliations:
Dept. of Electronic and Electrical Engineering, University College Dublin, Belfield, Dublin 4, Ireland;Dept. of Electronic and Electrical Engineering, University College Dublin, Belfield, Dublin 4, Ireland;Dept. of Electronic and Electrical Engineering, University College Dublin, Belfield, Dublin 4, Ireland
Venue:
AVBPA'05 Proceedings of the 5th international conference on Audio- and Video-Based Biometric Person Authentication
Year:
2005

Citing 10
Cited 1

On Combining Classifiers

IEEE Transactions on Pattern Analysis and Machine Intelligence
Extraction of Visual Features for Lipreading

IEEE Transactions on Pattern Analysis and Machine Intelligence
Person Identification Using Multiple Cues

IEEE Transactions on Pattern Analysis and Machine Intelligence
Sum Versus Vote Fusion in Multiple Classifier Systems

IEEE Transactions on Pattern Analysis and Machine Intelligence
Person identification using automatic integration of speech, lip, and face experts

WBMA '03 Proceedings of the 2003 ACM SIGMM workshop on Biometrics methods and applications
Noise adaptive stream weighting in audio-visual speech recognition

EURASIP Journal on Applied Signal Processing
The use of Speech and Lip Modalities for Robust Speaker Verification under Adverse Conditions

ICMCS '99 Proceedings of the IEEE International Conference on Multimedia Computing and Systems - Volume 2
Audio-visual speaker identification based on the use of dynamic audio and visual features

AVBPA'03 Proceedings of the 4th international conference on Audio- and video-based biometric person authentication
A review of speech-based bimodal recognition

IEEE Transactions on Multimedia
Adaptive classifier integration for robust pattern recognition

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Robust automatic human identification using face, mouth, and acoustic information

AMFG'05 Proceedings of the Second international conference on Analysis and Modelling of Faces and Gestures

Quantified Score

Hi-index	0.00

Visualization

Abstract

An audio-visual speaker identification system is described, where the audio and visual speech modalities are fused by an automatic unsupervised process that adapts to local classifier performance, by taking into account the output score based reliability estimates of both modalities. Previously reported methods do not consider that both the audio and the visual modalities can be degraded. The visual modality uses the speakers lip information. To test the robustness of the system, the audio and visual modalities are degraded to emulate various levels of train/test mismatch; employing additive white Gaussian noise for the audio and JPEG compression for the visual signals. Experiments are carried out on a large augmented data set from the XM2VTS database. The results show improved audio-visual accuracies at all tested levels of audio and visual degradation, compared to the individual audio or visual modality accuracies. For high mismatch levels, the audio, visual, and auto-adapted audio-visual accuracies are 37.1%, 48%, and 71.4% respectively.