EM detection of common origin of multi-modal cues

Authors:
A. K. Noulas;B. J. A. Kröse
Affiliations:
University Of Amsterdam and Intelligent Systems Lab Amsterdam;University Of Amsterdam and Intelligent Systems Lab Amsterdam
Venue:
Proceedings of the 8th international conference on Multimodal interfaces
Year:
2006

Citing 7
Cited 1

Detecting Faces in Images: A Survey

IEEE Transactions on Pattern Analysis and Machine Intelligence
Speaker change detection and tracking in real-time news broadcasting analysis

Proceedings of the tenth ACM international conference on Multimedia
Probabalistic Models and Informative Subspaces for Audiovisual Correspondence

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part III
Multimodal processing by finding common cause

Communications of the ACM - Multimodal interfaces that flex, adapt, and persist
Robust Real-Time Face Detection

International Journal of Computer Vision
Visual Speech Recognition with Loosely Synchronized Feature Streams

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
A posterior unionmodel with applications to robust speech and speaker recognition

EURASIP Journal on Applied Signal Processing

On-line multi-modal speaker diarization

Proceedings of the 9th international conference on Multimodal interfaces

Quantified Score

Hi-index	0.00

Visualization

Abstract

Content analysis of clips containing people speaking involves processing informative cues coming from different modalities. These cues are usually the words extracted from the audio modality, and the identity of the persons appearing in the video modality of the clip. To achieve efficient assignment of these cues to the person that created them, we propose a Bayesian network model that utilizes the extracted feature characteristics, their relations and their temporal patterns. We use the EM algorithm in which the E-step estimates the expectation of the complete-data log-likelihood with respect to the hidden variables - that is the identity of the speakers and the visible persons. In the M-step , the person models that maximize this expectation are computed. This framework produces excellent results, exhibiting exceptional robustness when dealing with low quality data.