Markov random field modeling in computer vision
Markov random field modeling in computer vision
Distributed meetings: a meeting capture and broadcasting system
Proceedings of the tenth ACM international conference on Multimedia
Color-Based Probabilistic Tracking
ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part I
Audio-Video Sensor Fusion with Probabilistic Graphical Models
ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part I
Multimodal Speaker Detection Using Input/Output Dynamic Bayesian Networks
ICMI '00 Proceedings of the Third International Conference on Advances in Multimodal Interfaces
Monte Carlo Strategies in Scientific Computing
Monte Carlo Strategies in Scientific Computing
Extracting information from multimedia meeting collections
Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval
Speaker localization for microphone array-based ASR: the effects of accuracy on overlapping speech
Proceedings of the 8th international conference on Multimodal interfaces
Proceedings of the 8th international conference on Multimodal interfaces
Client and speech detection system for intelligent infokiosk
TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Learning speaker, addressee and overlap detection models from multimodal streams
Proceedings of the 14th ACM international conference on Multimodal interaction
Hi-index | 0.00 |
Tracking speakers in multiparty conversations constitutes a fundamental task for automatic meeting analysis. In this paper, we present a probabilistic approach to jointly track the location and speaking activity of multiple speakers in a multisensor meeting room, equipped with a small microphone array and multiple uncalibrated cameras. Our framework is based on a mixed-state dynamic graphical model defined on a multiperson state-space, which includes the explicit definition of a proximity-based interaction model. The model integrates audio-visual (AV) data through a novel observation model. Audio observations are derived from a source localization algorithm. Visual observations are based on models of the shape and spatial structure of human heads. Approximate inference in our model, needed given its complexity, is performed with a Markov Chain Monte Carlo particle filter (MCMC-PF), which results in high sampling efficiency. We present results -based on an objective evaluation procedure-that show that our framework (1) is capable of locating and tracking the position and speaking activity of multiple meeting participants engaged in real conversations with good accuracy; (2) can deal with cases of visual clutter and partial occlusion; and (3) significantly outperforms a traditional sampling-based approach.