Conjugate mixture models for clustering multimodal data

Authors:
Vasil Khalidov;Florence Forbes;Radu Horaud
Affiliations:
-;-;-
Venue:
Neural Computation
Year:
2011

Citing 25
Cited 4

Three-dimensional computer vision: a geometric viewpoint

Three-dimensional computer vision: a geometric viewpoint
Integrating vision and touch for object recognition tasks

Multisensor integration and fusion for intelligent machines and systems
An introduction to variational methods for graphical models

Learning in graphical models
Multiple view geometry in computer vision

Multiple view geometry in computer vision
Mean Shift: A Robust Approach Toward Feature Space Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
Mobile Robot Localization and Map Building: A Multisensor Fusion Approach

Mobile Robot Localization and Map Building: A Multisensor Fusion Approach
Computer Vision: A Modern Approach

Computer Vision: A Modern Approach
A Graphical Model for Audiovisual Object Tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence
On Space-Time Interest Points

International Journal of Computer Vision
The Cocktail Party Problem

Neural Computation
Using Bayes’ Rule to Model Multisensory Enhancement in the Superior Colliculus

Neural Computation
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Audio-Visual Speaker Localization Using Graphical Models

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 01
Approaches to Multisensor Data Fusion in Target Tracking: A Survey

IEEE Transactions on Knowledge and Data Engineering
Computational Auditory Scene Analysis: Principles, Algorithms, and Applications

Computational Auditory Scene Analysis: Principles, Algorithms, and Applications
Mathematical Techniques in Multisensor Data Fusion (Artech House Information Warfare Library)

Mathematical Techniques in Multisensor Data Fusion (Artech House Information Warfare Library)
Noise adaptive stream weighting in audio-visual speech recognition

EURASIP Journal on Applied Signal Processing
Dynamic Bayesian networks for audio-visual speech recognition

EURASIP Journal on Applied Signal Processing
Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment

Speech Communication
The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Structure Inference for Bayesian Multisensory Scene Understanding

IEEE Transactions on Pattern Analysis and Machine Intelligence
Multi-Sensor Data Fusion: An Introduction

Multi-Sensor Data Fusion: An Introduction
Patterns of binocular disparity for a fixating observer

BVAI'07 Proceedings of the 2nd international conference on Advances in brain, vision and artificial intelligence
Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

IEEE Transactions on Audio, Speech, and Language Processing
Speaker association with signal-level audiovisual fusion

IEEE Transactions on Multimedia

Finding audio-visual events in informal social gatherings

ICMI '11 Proceedings of the 13th international conference on multimodal interfaces
The cocktail party robot: sound source separation and localisation with an active binaural head

HRI '12 Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction
Social event detection using multimodal clustering and integrating supervisory signals

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Audio-visual robot command recognition: D-META'12 grand challenge

Proceedings of the 14th ACM international conference on Multimodal interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of multimodal clustering arises whenever the data are gathered with several physically different sensors. Observations from different modalities are not necessarily aligned in the sense there there is no obvious way to associate or compare them in some common space. A solution may consist in considering multiple clustering tasks independently for each modality. The main difficulty with such an approach is to guarantee that the unimodal clusterings are mutually consistent. In this letter, we show that multimodal clustering can be addressed within a novel framework: conjugate mixture models. These models exploit the explicit transformations that are often available between an unobserved parameter space (objects) and each of the observation spaces (sensors). We formulate the problem as a likelihood maximization task and derive the associated conjugate expectation-maximization algorithm. The convergence properties of the proposed algorithm are thoroughly investigated. Several local and global optimization techniques are proposed in order to increase its convergence speed. Two initialization strategies are proposed and compared. A consistent model selection criterion is proposed. The algorithm and its variants are tested and evaluated within the task of 3D localization of several speakers using both auditory and visual data.