Detection and localization of 3d audio-visual objects using unsupervised clustering

Authors:
Vasil Khalidov;Florence Forbes;Miles Hansard;Elise Arnaud;Radu Horaud
Affiliations:
INRIA Rhône-Alpes, Montbonnot, France;INRIA Rhône-Alpes, Montbonnot, France;INRIA Rhône-Alpes, Montbonnot, France;INRIA Rhône-Alpes, Montbonnot, and Université Joseph Fourier, Grenoble, France;INRIA Rhône-Alpes, Montbonnot, France
Venue:
ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Year:
2008

Citing 13
Cited 2

A Graphical Model for Audiovisual Object Tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence
A joint particle filter for audio-visual speaker tracking

ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Audio-Visual Speaker Localization Using Graphical Models

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 01
Joint audio-visual tracking using particle filters

EURASIP Journal on Applied Signal Processing
Noise adaptive stream weighting in audio-visual speech recognition

EURASIP Journal on Applied Signal Processing
Audio-visual multi-person tracking and identification for smart environments

Proceedings of the 15th international conference on Multimedia
The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Structure inference for Bayesian multisensory perception and tracking

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A generative approach to audio-visual person tracking

CLEAR'06 Proceedings of the 1st international evaluation conference on Classification of events, activities and relationships
Patterns of binocular disparity for a fixating observer

BVAI'07 Proceedings of the 2nd international conference on Advances in brain, vision and artificial intelligence
Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

IEEE Transactions on Audio, Speech, and Language Processing
Speaker association with signal-level audiovisual fusion

IEEE Transactions on Multimedia

The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Finding audio-visual events in informal social gatherings

ICMI '11 Proceedings of the 13th international conference on multimodal interfaces

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the issues of detecting and localizing objects in a scene that are both seen and heard. We explain the benefits of a human-like configuration of sensors (binaural and binocular) for gathering auditory and visual observations. It is shown that the detection and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data into a common audio-visual 3D representation via a pair of mixture models. Inference is performed by a version of the expectation-maximization algorithm, which is formally derived, and which provides cooperative estimates of both the auditory activity and the 3D position of each object. We describe several experiments with single- and multiple-speaker detection and localization, in the presence of other audio sources.