Audio-video biometric recognition for non-collaborative access granting

  • Authors:
  • Christian Micheloni;Sergio Canazza;Gian Luca Foresti

  • Affiliations:
  • Department of Computer Science, University of Udine, Via delle Scienze 206, 33100 Udine, Italy;Department of Historical and Documentary Sciences, University of Udine, Via Petracco 8, 33100 Udine, Italy;Department of Computer Science, University of Udine, Via delle Scienze 206, 33100 Udine, Italy

  • Venue:
  • Journal of Visual Languages and Computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, the problem of non-collaborative person identification for a secure access to facilities is addressed. The proposed solution adopts a face and a speaker recognition techniques. The integration of these two methods allows to improve the performance with respect to the two classifiers. In non-collaborative scenarios, the problem of face recognition first requires to detect the face pattern then to recognize it even when in non-frontal poses. In the current work, a histogram normalization, a boosting technique and a linear discrimination analysis have been exploited to solve typical problems like illumination variability, occlusions, pose variation, etc. In addition, a new temporal classification is proposed to improve the robustness of the frame-by-frame classification. This allows to project known classification techniques for still image recognition into a multi-frame context where the image capture allows dynamics in the environment. For the audio, a method for the automatic speaker identification in noisy environments is presented. In particular, we propose an optimization of a speech de-noising algorithm to optimize the performance of the extended Kalman filter (EKF). To provide a baseline system for the integration with our proposed speech de-noising algorithm, we use a conventional speaker recognition system, based on Gaussian mixture models and mel frequency cepstral coefficients (MFCCs) as features. To confirm the effectiveness of our methods, we performed video and speaker recognition tasks first separately then integrating the results. In particular, two different corpora have been used: (a) a public corpus (ELDSR for audio and FERRET for images) and (b) a dedicated audio/video corpus, in which the speakers read a list of sentences wearing a scarf or a full-face motorcycle helmet. Experimental results show that our methods are able to reduce significantly the classification error rate.