Audio-Visual feature fusion for speaker identification

  • Authors:
  • Noor Almaadeed;Amar Aggoun;Abbes Amira

  • Affiliations:
  • Department of Computer Engineering, Brunel University, London, UK;Department of Computer Engineering, Brunel University, London, UK;NIBEC, University of Ulster, Jordanstown, UK,College of Engineering, Qatar University, Qatar

  • Venue:
  • ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part I
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Analyses of facial and audio features have been considered separately in conventional speaker identification systems. Herein, we propose a robust algorithm for text-independent speaker identification based on a decision-level and feature-level fusion of facial and audio features. The suggested approach makes use of Mel-frequency Cepstral Coefficients (MFCCs) for audio signal processing, Viola-Jones Haar cascade algorithm for face detection from video, eigenface features (EFF) and Gaussian Mixture Models (GMMs) for feature-level and decision-level fusion of audio and video. Decision-level fusion is carried out using PCA for face and GMM for audio through AND voting. Feature-level fusion is investigated by combining both MFCC (audio) and PCA (face) features to construct a hybrid GMM for each speaker. Testing on GRID, a multi-speaker audio-visual database, shows that the decision-level fusion of PCA (face) and GMM (audio) achieves 98.2 % accuracy and it is almost 15 % more efficient than feature-level fusion.