Audio-Visual Clustering for 3D Speaker Localization

  • Authors:
  • Vasil Khalidov;Florence Forbes;Miles Hansard;Elise Arnaud;Radu Horaud

  • Affiliations:
  • INRIA Grenoble Rhône-Alpes, France 38334;INRIA Grenoble Rhône-Alpes, France 38334;INRIA Grenoble Rhône-Alpes, France 38334;INRIA Grenoble Rhône-Alpes, France 38334 and Université Joseph Fourier, Grenoble Cedex 9, France 38041;INRIA Grenoble Rhône-Alpes, France 38334

  • Venue:
  • MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

We address the issue of localizing individuals in a scene that contains several people engaged in a multiple-speaker conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations. We show that the localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity (speaking or not) and the 3D position of each speaker.