AV16.3: an audio-visual corpus for speaker localization and tracking

Authors:
Guillaume Lathoud;Jean-Marc Odobez;Daniel Gatica-Perez
Affiliations:
IDIAP Research Institute, Martigny, Switzerland;IDIAP Research Institute, Martigny, Switzerland;IDIAP Research Institute, Martigny, Switzerland
Venue:
MLMI'04 Proceedings of the First international conference on Machine Learning for Multimodal Interaction
Year:
2004

Citing 2
Cited 4

Color-Based Probabilistic Tracking

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part I
Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus

EURASIP Journal on Applied Signal Processing

The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Probability hypothesis density approach for multi-camera multi-object tracking

ACCV'07 Proceedings of the 8th Asian conference on Computer vision - Volume Part I
Robust acoustic source localization with TDOA based RANSAC algorithm

ICIC'09 Proceedings of the 5th international conference on Emerging intelligent computing technology and applications
An audio-video based IVA algorithm for source separation and evaluation on the AV16.3 corpus

LVA/ICA'12 Proceedings of the 10th international conference on Latent Variable Analysis and Signal Separation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Assessing the quality of a speaker localization or tracking algorithm on a few short examples is difficult, especially when the ground-truth is absent or not well defined. One step towards systematic performance evaluation of such algorithms is to provide time-continuous speaker location annotation over a series of real recordings, covering various test cases. Areas of interest include audio, video and audio-visual speaker localization and tracking. The desired location annotation can be either 2-dimensional (image plane) or 3-dimensional (physical space). This paper motivates and describes a corpus of audio-visual data called “AV16.3”, along with a method for 3-D location annotation based on calibrated cameras. “16.3” stands for 16 microphones and 3 cameras, recorded in a fully synchronized manner, in a meeting room. Part of this corpus has already been successfully used to report research results.