IEEE Transactions on Pattern Analysis and Machine Intelligence
Extraction of Visual Features for Lipreading
IEEE Transactions on Pattern Analysis and Machine Intelligence
Person Identification Using Multiple Cues
IEEE Transactions on Pattern Analysis and Machine Intelligence
Sum Versus Vote Fusion in Multiple Classifier Systems
IEEE Transactions on Pattern Analysis and Machine Intelligence
Person identification using automatic integration of speech, lip, and face experts
WBMA '03 Proceedings of the 2003 ACM SIGMM workshop on Biometrics methods and applications
Noise adaptive stream weighting in audio-visual speech recognition
EURASIP Journal on Applied Signal Processing
The use of Speech and Lip Modalities for Robust Speaker Verification under Adverse Conditions
ICMCS '99 Proceedings of the IEEE International Conference on Multimedia Computing and Systems - Volume 2
Audio-visual speaker identification based on the use of dynamic audio and visual features
AVBPA'03 Proceedings of the 4th international conference on Audio- and video-based biometric person authentication
A review of speech-based bimodal recognition
IEEE Transactions on Multimedia
Adaptive classifier integration for robust pattern recognition
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Robust automatic human identification using face, mouth, and acoustic information
AMFG'05 Proceedings of the Second international conference on Analysis and Modelling of Faces and Gestures
Hi-index | 0.00 |
An audio-visual speaker identification system is described, where the audio and visual speech modalities are fused by an automatic unsupervised process that adapts to local classifier performance, by taking into account the output score based reliability estimates of both modalities. Previously reported methods do not consider that both the audio and the visual modalities can be degraded. The visual modality uses the speakers lip information. To test the robustness of the system, the audio and visual modalities are degraded to emulate various levels of train/test mismatch; employing additive white Gaussian noise for the audio and JPEG compression for the visual signals. Experiments are carried out on a large augmented data set from the XM2VTS database. The results show improved audio-visual accuracies at all tested levels of audio and visual degradation, compared to the individual audio or visual modality accuracies. For high mismatch levels, the audio, visual, and auto-adapted audio-visual accuracies are 37.1%, 48%, and 71.4% respectively.