Audio-Visual speaker identification via adaptive fusion using reliability estimates of both modalities

  • Authors:
  • Niall A. Fox;Brian A. O'Mullane;Richard B. Reilly

  • Affiliations:
  • Dept. of Electronic and Electrical Engineering, University College Dublin, Belfield, Dublin 4, Ireland;Dept. of Electronic and Electrical Engineering, University College Dublin, Belfield, Dublin 4, Ireland;Dept. of Electronic and Electrical Engineering, University College Dublin, Belfield, Dublin 4, Ireland

  • Venue:
  • AVBPA'05 Proceedings of the 5th international conference on Audio- and Video-Based Biometric Person Authentication
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

An audio-visual speaker identification system is described, where the audio and visual speech modalities are fused by an automatic unsupervised process that adapts to local classifier performance, by taking into account the output score based reliability estimates of both modalities. Previously reported methods do not consider that both the audio and the visual modalities can be degraded. The visual modality uses the speakers lip information. To test the robustness of the system, the audio and visual modalities are degraded to emulate various levels of train/test mismatch; employing additive white Gaussian noise for the audio and JPEG compression for the visual signals. Experiments are carried out on a large augmented data set from the XM2VTS database. The results show improved audio-visual accuracies at all tested levels of audio and visual degradation, compared to the individual audio or visual modality accuracies. For high mismatch levels, the audio, visual, and auto-adapted audio-visual accuracies are 37.1%, 48%, and 71.4% respectively.