Pfinder: Real-Time Tracking of the Human Body
IEEE Transactions on Pattern Analysis and Machine Intelligence
CONDENSATION—Conditional Density Propagation forVisual Tracking
International Journal of Computer Vision
Automating camera management for lecture room environments
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Color-Based Probabilistic Tracking
ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part I
Towards Vision-Based 3-D People Tracking in a Smart Room
ICMI '02 Proceedings of the 4th IEEE International Conference on Multimodal Interfaces
A framework for speech source localization using sensor arrays
A framework for speech source localization using sensor arrays
Joint audio-visual tracking using particle filters
EURASIP Journal on Applied Signal Processing
Kalman filters for time delay of arrival-based source localization
EURASIP Journal on Applied Signal Processing
Speaker localization for microphone array-based ASR: the effects of accuracy on overlapping speech
Proceedings of the 8th international conference on Multimodal interfaces
Audio-visual perception of a lecturer in a smart seminar room
Signal Processing - Special section: Multimodal human-computer interfaces
Audiovisual head orientation estimation with particle filtering in multisensor scenarios
EURASIP Journal on Advances in Signal Processing
Head Orientation Estimation Using Particle Filtering in Multiview Scenarios
Multimodal Technologies for Perception of Humans
Audio-Visual Clustering for 3D Speaker Localization
MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Detection and localization of 3d audio-visual objects using unsupervised clustering
ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Evaluating multiple object tracking performance: the CLEAR MOT metrics
Journal on Image and Video Processing - Regular
Detecting, tracking and interacting with people in a public space
Proceedings of the 2009 international conference on Multimodal interfaces
3D person tracking with a color-based particle filter
RobVis'08 Proceedings of the 2nd international conference on Robot vision
Vision and RFID data fusion for tracking people in crowds by a mobile robot
Computer Vision and Image Understanding
An embedded audio-visual tracking and speech purification system on a dual-core processor platform
Microprocessors & Microsystems
Real-time audio-to-score alignment using particle filter for coplayer music robots
EURASIP Journal on Advances in Signal Processing - Special issue on musical applications of real-time signal processing
Integrating the projective transform with particle filtering for visual tracking
Journal on Image and Video Processing - Special issue on advanced video-based surveillance
Finding audio-visual events in informal social gatherings
ICMI '11 Proceedings of the 13th international conference on multimodal interfaces
Estimating the lecturer's head pose in seminar scenarios – a multi-view approach
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Microphone array driven speech recognition: influence of localization on the word error rate
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
The connector service-predicting availability in mobile contexts
MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Hi-index | 0.00 |
In this paper, we present a novel approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view face detection and upper body detection. On the audio side, the time delays of arrival between pairs of microphones are estimated with a generalized cross correlation function. Computationally expensive features are evaluated only at the particles' projected positions in the respective camera images, thus the complexity of the proposed algorithm is low. We evaluated the system on data that was recorded during actual lectures. The results of our experiments were 36 cm average error for video only tracking, 46 cm for audio only, and 31 cm for the combined audio-video system.