Audio-Visual feature fusion for speaker identification

Authors:
Noor Almaadeed;Amar Aggoun;Abbes Amira
Affiliations:
Department of Computer Engineering, Brunel University, London, UK;Department of Computer Engineering, Brunel University, London, UK;NIBEC, University of Ulster, Jordanstown, UK,College of Engineering, Qatar University, Qatar
Venue:
ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part I
Year:
2012

Citing 8
Cited 0

Information fusion in biometrics

Pattern Recognition Letters - Special issue: Audio- and video-based biometric person authentication (AVBPA 2001)
Face recognition: A literature survey

ACM Computing Surveys (CSUR)
Robust Real-Time Face Detection

International Journal of Computer Vision
Rapid and brief communication: Combining classifier decisions for robust speaker identification

Pattern Recognition
Eigenfaces for recognition

Journal of Cognitive Neuroscience
Why Is Facial Occlusion a Challenging Problem?

ICB '09 Proceedings of the Third International Conference on Advances in Biometrics
Speaker Verification Based on Different Vector Quantization Techniques with Gaussian Mixture Models

NSS '09 Proceedings of the 2009 Third International Conference on Network and System Security
Audio-visual identity verification: an introductory overview

Progress in nonlinear speech processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Analyses of facial and audio features have been considered separately in conventional speaker identification systems. Herein, we propose a robust algorithm for text-independent speaker identification based on a decision-level and feature-level fusion of facial and audio features. The suggested approach makes use of Mel-frequency Cepstral Coefficients (MFCCs) for audio signal processing, Viola-Jones Haar cascade algorithm for face detection from video, eigenface features (EFF) and Gaussian Mixture Models (GMMs) for feature-level and decision-level fusion of audio and video. Decision-level fusion is carried out using PCA for face and GMM for audio through AND voting. Feature-level fusion is investigated by combining both MFCC (audio) and PCA (face) features to construct a hybrid GMM for each speaker. Testing on GRID, a multi-speaker audio-visual database, shows that the decision-level fusion of PCA (face) and GMM (audio) achieves 98.2 % accuracy and it is almost 15 % more efficient than feature-level fusion.