MCMC-Based Particle Filtering for Tracking a Variable Number of Interacting Targets
IEEE Transactions on Pattern Analysis and Machine Intelligence
ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Visual lip activity detection and speaker detection using mouth region intensities
IEEE Transactions on Circuits and Systems for Video Technology
An information theoretic approach to speaker diarization of meeting data
IEEE Transactions on Audio, Speech, and Language Processing
Recognizing visual focus of attention from head pose in natural meetings
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on human computing
Speaker localisation using audio-visual synchrony: an empirical study
CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
Learning large margin likelihoods for realtime head pose tracking
ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
The AMI meeting corpus: a pre-announcement
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Speaker diarization for multi-microphone meetings using only between-channel differences
MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Just-in-time multimodal association and fusion from home entertainment
ICME '11 Proceedings of the 2011 IEEE International Conference on Multimedia and Expo
Reasoning for video-mediated group communication
ICME '11 Proceedings of the 2011 IEEE International Conference on Multimedia and Expo
Hi-index | 0.00 |
We describe the design of a system consisting of several state-of-the-art real-time audio and video processing components enabling multimodal stream manipulation (e.g., automatic online editing for multiparty videoconferencing applications) in open, unconstrained environments. The underlying algorithms are designed to allow multiple people to enter, interact, and leave the observable scene with no constraints. They comprise continuous localisation of audio objects and its application for spatial audio object coding, detection, and tracking of faces, estimation of head poses and visual focus of attention, detection and localisation of verbal and paralinguistic events, and the association and fusion of these different events. Combined all together, they represent multimodal streams with audio objects and semantic video objects and provide semantic information for stream manipulation systems (like a virtual director). Various experiments have been performed to evaluate the performance of the system. The obtained results demonstrate the effectiveness of the proposed design, the various algorithms, and the benefit of fusing different modalities in this scenario.