A joint particle filter for audio-visual speaker tracking

  • Authors:
  • Kai Nickel;Tobias Gehrig;Rainer Stiefelhagen;John McDonough

  • Affiliations:
  • Universität Karlsruhe (TH), Germany;Universität Karlsruhe (TH), Germany;Universität Karlsruhe (TH), Germany;Universität Karlsruhe (TH), Germany

  • Venue:
  • ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present a novel approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view face detection and upper body detection. On the audio side, the time delays of arrival between pairs of microphones are estimated with a generalized cross correlation function. Computationally expensive features are evaluated only at the particles' projected positions in the respective camera images, thus the complexity of the proposed algorithm is low. We evaluated the system on data that was recorded during actual lectures. The results of our experiments were 36 cm average error for video only tracking, 46 cm for audio only, and 31 cm for the combined audio-video system.