A joint particle filter for audio-visual speaker tracking

Authors:
Kai Nickel;Tobias Gehrig;Rainer Stiefelhagen;John McDonough
Affiliations:
Universität Karlsruhe (TH), Germany;Universität Karlsruhe (TH), Germany;Universität Karlsruhe (TH), Germany;Universität Karlsruhe (TH), Germany
Venue:
ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Year:
2005

Citing 8
Cited 17

Pfinder: Real-Time Tracking of the Human Body

IEEE Transactions on Pattern Analysis and Machine Intelligence
CONDENSATION—Conditional Density Propagation forVisual Tracking

International Journal of Computer Vision
Automating camera management for lecture room environments

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Color-Based Probabilistic Tracking

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part I
Towards Vision-Based 3-D People Tracking in a Smart Room

ICMI '02 Proceedings of the 4th IEEE International Conference on Multimodal Interfaces
A framework for speech source localization using sensor arrays

A framework for speech source localization using sensor arrays
Joint audio-visual tracking using particle filters

EURASIP Journal on Applied Signal Processing
Kalman filters for time delay of arrival-based source localization

EURASIP Journal on Applied Signal Processing

Speaker localization for microphone array-based ASR: the effects of accuracy on overlapping speech

Proceedings of the 8th international conference on Multimodal interfaces
Audio-visual perception of a lecturer in a smart seminar room

Signal Processing - Special section: Multimodal human-computer interfaces
Audiovisual head orientation estimation with particle filtering in multisensor scenarios

EURASIP Journal on Advances in Signal Processing
Head Orientation Estimation Using Particle Filtering in Multiview Scenarios

Multimodal Technologies for Perception of Humans
Audio-Visual Clustering for 3D Speaker Localization

MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Detection and localization of 3d audio-visual objects using unsupervised clustering

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Evaluating multiple object tracking performance: the CLEAR MOT metrics

Journal on Image and Video Processing - Regular
Detecting, tracking and interacting with people in a public space

Proceedings of the 2009 international conference on Multimodal interfaces
3D person tracking with a color-based particle filter

RobVis'08 Proceedings of the 2nd international conference on Robot vision
Vision and RFID data fusion for tracking people in crowds by a mobile robot

Computer Vision and Image Understanding
An embedded audio-visual tracking and speech purification system on a dual-core processor platform

Microprocessors & Microsystems
Real-time audio-to-score alignment using particle filter for coplayer music robots

EURASIP Journal on Advances in Signal Processing - Special issue on musical applications of real-time signal processing
Integrating the projective transform with particle filtering for visual tracking

Journal on Image and Video Processing - Special issue on advanced video-based surveillance
Finding audio-visual events in informal social gatherings

ICMI '11 Proceedings of the 13th international conference on multimodal interfaces
Estimating the lecturer's head pose in seminar scenarios – a multi-view approach

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Microphone array driven speech recognition: influence of localization on the word error rate

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
The connector service-predicting availability in mobile contexts

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a novel approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view face detection and upper body detection. On the audio side, the time delays of arrival between pairs of microphones are estimated with a generalized cross correlation function. Computationally expensive features are evaluated only at the particles' projected positions in the respective camera images, thus the complexity of the proposed algorithm is low. We evaluated the system on data that was recorded during actual lectures. The results of our experiments were 36 cm average error for video only tracking, 46 cm for audio only, and 31 cm for the combined audio-video system.