Audio-Visual Clustering for 3D Speaker Localization

Authors:
Vasil Khalidov;Florence Forbes;Miles Hansard;Elise Arnaud;Radu Horaud
Affiliations:
INRIA Grenoble Rhône-Alpes, France 38334;INRIA Grenoble Rhône-Alpes, France 38334;INRIA Grenoble Rhône-Alpes, France 38334;INRIA Grenoble Rhône-Alpes, France 38334 and Université Joseph Fourier, Grenoble Cedex 9, France 38041;INRIA Grenoble Rhône-Alpes, France 38334
Venue:
MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Year:
2008

Citing 9
Cited 0

A Graphical Model for Audiovisual Object Tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence
Robust Real-Time Face Detection

International Journal of Computer Vision
A joint particle filter for audio-visual speaker tracking

ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Audio-Visual Speaker Localization Using Graphical Models

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 01
Joint audio-visual tracking using particle filters

EURASIP Journal on Applied Signal Processing
Noise adaptive stream weighting in audio-visual speech recognition

EURASIP Journal on Applied Signal Processing
Patterns of binocular disparity for a fixating observer

BVAI'07 Proceedings of the 2nd international conference on Advances in brain, vision and artificial intelligence
Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

IEEE Transactions on Audio, Speech, and Language Processing
Speaker association with signal-level audiovisual fusion

IEEE Transactions on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the issue of localizing individuals in a scene that contains several people engaged in a multiple-speaker conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations. We show that the localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity (speaking or not) and the 3D position of each speaker.