Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

Authors:
D. Gatica-Perez;G. Lathoud;J. -M. Odobez;I. McCowan
Affiliations:
IDIAP Res. Inst., Ecole Polytechnique Federale de Lausanne, Martigny;-;-;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2007

Citing 0
Cited 16

Audiovisual head orientation estimation with particle filtering in multisensor scenarios

EURASIP Journal on Advances in Signal Processing
Head Orientation Estimation Using Particle Filtering in Multiview Scenarios

Multimodal Technologies for Perception of Humans
Audio-Visual Clustering for 3D Speaker Localization

MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Detection and localization of 3d audio-visual objects using unsupervised clustering

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Maximum a posteriori multimodal 3D object localization with a depth sensor and stereo microphones

Proceedings of the 2nd International Conference on Immersive Telecommunications
A speaker diarization method based on the probabilistic fusion of audio-visual location information

Proceedings of the 2009 international conference on Multimodal interfaces
An embedded audio-visual tracking and speech purification system on a dual-core processor platform

Microprocessors & Microsystems
Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Using reverberation to improve range and elevation discrimination for small array sound source localization

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on processing reverberant speech: methodologies and applications
Multimodal biometric human recognition for perceptual human-computer interaction

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Conjugate mixture models for clustering multimodal data

Neural Computation
Efficient video coding based on audio-visual focus of attention

Journal of Visual Communication and Image Representation
Finding audio-visual events in informal social gatherings

ICMI '11 Proceedings of the 13th international conference on multimodal interfaces
Voice activity detection and speaker localization using audiovisual cues

Pattern Recognition Letters
Collaborative personal speaker identification: A generalized approach

Pervasive and Mobile Computing
Particle filtering for TDOA based acoustic source tracking: Nonconcurrent Multiple Talkers

Signal Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tracking speakers in multiparty conversations constitutes a fundamental task for automatic meeting analysis. In this paper, we present a novel probabilistic approach to jointly track the location and speaking activity of multiple speakers in a multisensor meeting room, equipped with a small microphone array and multiple uncalibrated cameras. Our framework is based on a mixed-state dynamic graphical model defined on a multiperson state-space, which includes the explicit definition of a proximity-based interaction model. The model integrates audiovisual (AV) data through a novel observation model. Audio observations are derived from a source localization algorithm. Visual observations are based on models of the shape and spatial structure of human heads. Approximate inference in our model, needed given its complexity, is performed with a Markov Chain Monte Carlo particle filter (MCMC-PF), which results in high sampling efficiency. We present results-based on an objective evaluation procedure-that show that our framework 1) is capable of locating and tracking the position and speaking activity of multiple meeting participants engaged in real conversations with good accuracy, 2) can deal with cases of visual clutter and occlusion, and 3) significantly outperforms a traditional sampling-based approach