Investigating the use of visual focus of attention for audio-visual speaker diarisation

Authors:
Giulia Garau;Sileye Ba;Hervé Bourlard;Jean-Marc Odobez
Affiliations:
Idiap Research Institute, Martigny, Switzerland;Idiap Research Institute, Martigny, Switzerland;Idiap Research Institute, Martigny, Switzerland;Idiap Research Institute, Martigny, Switzerland
Venue:
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Year:
2009

Citing 8
Cited 1

Eye gaze patterns in conversations: there is more to conversational agents than meets the eyes

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
On-line multi-modal speaker diarization

Proceedings of the 9th international conference on Multimodal interfaces
A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Multi-modal speaker diarization of real-world meetings using compressed-domain video features

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Automatic nonverbal analysis of social interaction in small groups: A review

Image and Vision Computing
Visual activity context for focus of attention estimation in dynamic meetings

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
The AMI meeting corpus: a pre-announcement

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
An overview of automatic speaker diarization systems

IEEE Transactions on Audio, Speech, and Language Processing

Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Audio-visual speaker diarisation is the task of estimating ``who spoke when'' using audio and visual cues. In this paper we propose the combination of an audio diarisation system with psychology inspired visual features, reporting experiments on multiparty meetings, a challenging domain characterised by unconstrained interaction and participant movements. More precisely the role of gaze in coordinating speaker turns was exploited by the use of Visual Focus of Attention features. Experiments were performed both with the reference and 3 automatic VFoA estimation systems, based on head pose and visual activity cues, of increasing complexity. VFoA features yielded consistent speaker diarisation improvements in combination with audio features using a multi-stream approach.