A Graphical Model for Audiovisual Object Tracking
IEEE Transactions on Pattern Analysis and Machine Intelligence
A joint particle filter for audio-visual speaker tracking
ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Pattern Recognition and Machine Learning (Information Science and Statistics)
Pattern Recognition and Machine Learning (Information Science and Statistics)
Audio-Visual Speaker Localization Using Graphical Models
ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 01
Joint audio-visual tracking using particle filters
EURASIP Journal on Applied Signal Processing
Noise adaptive stream weighting in audio-visual speech recognition
EURASIP Journal on Applied Signal Processing
Audio-visual multi-person tracking and identification for smart environments
Proceedings of the 15th international conference on Multimedia
The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements
ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Structure inference for Bayesian multisensory perception and tracking
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A generative approach to audio-visual person tracking
CLEAR'06 Proceedings of the 1st international evaluation conference on Classification of events, activities and relationships
Patterns of binocular disparity for a fixating observer
BVAI'07 Proceedings of the 2nd international conference on Advances in brain, vision and artificial intelligence
Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings
IEEE Transactions on Audio, Speech, and Language Processing
Speaker association with signal-level audiovisual fusion
IEEE Transactions on Multimedia
The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements
ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Finding audio-visual events in informal social gatherings
ICMI '11 Proceedings of the 13th international conference on multimodal interfaces
Hi-index | 0.00 |
This paper addresses the issues of detecting and localizing objects in a scene that are both seen and heard. We explain the benefits of a human-like configuration of sensors (binaural and binocular) for gathering auditory and visual observations. It is shown that the detection and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data into a common audio-visual 3D representation via a pair of mixture models. Inference is performed by a version of the expectation-maximization algorithm, which is formally derived, and which provides cooperative estimates of both the auditory activity and the 3D position of each object. We describe several experiments with single- and multiple-speaker detection and localization, in the presence of other audio sources.