Multiple view geometry in computer visiond
Multiple view geometry in computer visiond
Proceedings of the 6th international conference on Multimodal interfaces
Detection and localization of 3d audio-visual objects using unsupervised clustering
ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
The BANCA database and evaluation protocol
AVBPA'03 Proceedings of the 4th international conference on Audio- and video-based biometric person authentication
AV16.3: an audio-visual corpus for speaker localization and tracking
MLMI'04 Proceedings of the First international conference on Machine Learning for Multimodal Interaction
Detection and localization of 3d audio-visual objects using unsupervised clustering
ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Conjugate mixture models for clustering multimodal data
Neural Computation
Finding audio-visual events in informal social gatherings
ICMI '11 Proceedings of the 13th international conference on multimodal interfaces
The vernissage corpus: a conversational human-robot-interaction dataset
Proceedings of the 8th ACM/IEEE international conference on Human-robot interaction
Hi-index | 0.00 |
This paper describes the acquisition and content of a new multi-modal database. Some tools for making use of the data streams are also presented. The Computational Audio-Visual Analysis (CAVA) database is a unique collection of three synchronised data streams obtained from a binaural microphone pair, a stereoscopic camera pair and a head tracking device. All recordings are made from the perspective of a person; i.e. what would a human with natural head movements see and hear in a given environment. The database is intended to facilitate research into humans' ability to optimise their multi-modal sensory input and fills a gap by providing data that enables human centred audio-visual scene analysis. It also enables 3D localisation using either audio, visual, or audio-visual cues. A total of 50 sessions, with varying degrees of visual and auditory complexity, were recorded. These range from seeing and hearing a single speaker moving in and out of field of view, to moving around a 'cocktail party' style situation, mingling and joining different small groups of people chatting.