Finding audio-visual events in informal social gatherings

Authors:
Xavier Alameda-Pineda;Vasil Khalidov;Radu Horaud;Florence Forbes
Affiliations:
INRIA, Grenoble, France;IDIAP, Martigny, Switzerland;INRIA, Grenoble, France;INRIA, Grenoble, France
Venue:
ICMI '11 Proceedings of the 13th international conference on multimodal interfaces
Year:
2011

Citing 16
Cited 1

Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
A Graphical Model for Audiovisual Object Tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Mixture Model and EM-Based Algorithm for Class Discovery, Robust Classification, and Outlier Rejection in Mixed Labeled/Unlabeled Data Sets

IEEE Transactions on Pattern Analysis and Machine Intelligence
A joint particle filter for audio-visual speaker tracking

ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Using Bayes’ Rule to Model Multisensory Enhancement in the Superior Colliculus

Neural Computation
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Joint audio-visual tracking using particle filters

EURASIP Journal on Applied Signal Processing
The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Detection and localization of 3d audio-visual objects using unsupervised clustering

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Structure Inference for Bayesian Multisensory Scene Understanding

IEEE Transactions on Pattern Analysis and Machine Intelligence
Conjugate mixture models for clustering multimodal data

Neural Computation
A Cluster Separation Measure

IEEE Transactions on Pattern Analysis and Machine Intelligence
Nonparametric hypothesis tests for statistical dependency

IEEE Transactions on Signal Processing
Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

IEEE Transactions on Audio, Speech, and Language Processing
Speaker association with signal-level audiovisual fusion

IEEE Transactions on Multimedia
Extraction of Audio Features Specific to Speech Production for Multimodal Speaker Detection

IEEE Transactions on Multimedia

Audio-visual robot command recognition: D-META'12 grand challenge

Proceedings of the 14th ACM international conference on Multimodal interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we address the problem of detecting and localizing objects that can be both seen and heard, e.g., people. This may be solved within the framework of data clustering. We propose a new multimodal clustering algorithm based on a Gaussian mixture model, where one of the modalities (visual data) is used to supervise the clustering process. This is made possible by mapping both modalities into the same metric space. To this end, we fully exploit the geometric and physical properties of an audio-visual sensor based on binocular vision and binaural hearing. We propose an EM algorithm that is theoretically well justified, intuitive, and extremely efficient from a computational point of view. This efficiency makes the method implementable on advanced platforms such as humanoid robots. We describe in detail tests and experiments performed with publicly available data sets that yield very interesting results.