Real-time audio-visual analysis for multiperson videoconferencing

Authors:
Petr Motlicek;Stefan Duffner;Danil Korchagin;Hervé Bourlard;Carl Scheffler;Jean-Marc Odobez;Giovanni Del Galdo;Markus Kallinger;Oliver Thiergart
Affiliations:
Idiap Research Institute, Martigny, Switzerland;Idiap Research Institute, Martigny, Switzerland and Université de Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205, Lyon, France;Idiap Research Institute, Martigny, Switzerland;Idiap Research Institute, Martigny, Switzerland;Idiap Research Institute, Martigny, Switzerland;Idiap Research Institute, Martigny, Switzerland;Fraunhofer IIS, Erlangen, Germany;Fraunhofer IIS, Erlangen, Germany;Fraunhofer IIS, Erlangen, Germany
Venue:
Advances in Multimedia
Year:
2013

Citing 11
Cited 0

MCMC-Based Particle Filtering for Tracking a Variable Number of Interacting Targets

IEEE Transactions on Pattern Analysis and Machine Intelligence
A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Visual lip activity detection and speaker detection using mouth region intensities

IEEE Transactions on Circuits and Systems for Video Technology
An information theoretic approach to speaker diarization of meeting data

IEEE Transactions on Audio, Speech, and Language Processing
Recognizing visual focus of attention from head pose in natural meetings

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on human computing
Speaker localisation using audio-visual synchrony: an empirical study

CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
Learning large margin likelihoods for realtime head pose tracking

ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
The AMI meeting corpus: a pre-announcement

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Speaker diarization for multi-microphone meetings using only between-channel differences

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Just-in-time multimodal association and fusion from home entertainment

ICME '11 Proceedings of the 2011 IEEE International Conference on Multimedia and Expo
Reasoning for video-mediated group communication

ICME '11 Proceedings of the 2011 IEEE International Conference on Multimedia and Expo

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe the design of a system consisting of several state-of-the-art real-time audio and video processing components enabling multimodal stream manipulation (e.g., automatic online editing for multiparty videoconferencing applications) in open, unconstrained environments. The underlying algorithms are designed to allow multiple people to enter, interact, and leave the observable scene with no constraints. They comprise continuous localisation of audio objects and its application for spatial audio object coding, detection, and tracking of faces, estimation of head poses and visual focus of attention, detection and localisation of verbal and paralinguistic events, and the association and fusion of these different events. Combined all together, they represent multimodal streams with audio objects and semantic video objects and provide semantic information for stream manipulation systems (like a virtual director). Various experiments have been performed to evaluate the performance of the system. The obtained results demonstrate the effectiveness of the proposed design, the various algorithms, and the benefit of fusing different modalities in this scenario.