Detecting Faces in Images: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence
Speaker change detection and tracking in real-time news broadcasting analysis
Proceedings of the tenth ACM international conference on Multimedia
Probabalistic Models and Informative Subspaces for Audiovisual Correspondence
ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part III
Multimodal processing by finding common cause
Communications of the ACM - Multimodal interfaces that flex, adapt, and persist
Robust Real-Time Face Detection
International Journal of Computer Vision
Visual Speech Recognition with Loosely Synchronized Feature Streams
ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
A posterior unionmodel with applications to robust speech and speaker recognition
EURASIP Journal on Applied Signal Processing
On-line multi-modal speaker diarization
Proceedings of the 9th international conference on Multimodal interfaces
Hi-index | 0.00 |
Content analysis of clips containing people speaking involves processing informative cues coming from different modalities. These cues are usually the words extracted from the audio modality, and the identity of the persons appearing in the video modality of the clip. To achieve efficient assignment of these cues to the person that created them, we propose a Bayesian network model that utilizes the extracted feature characteristics, their relations and their temporal patterns. We use the EM algorithm in which the E-step estimates the expectation of the complete-data log-likelihood with respect to the hidden variables - that is the identity of the speakers and the visible persons. In the M-step , the person models that maximize this expectation are computed. This framework produces excellent results, exhibiting exceptional robustness when dealing with low quality data.