Multimodal Speaker Detection Using Input/Output Dynamic Bayesian Networks

Authors:
Vladimir Pavlovic;Ashutosh Garg;James M. Rehg
Affiliations:
-;-;-
Venue:
ICMI '00 Proceedings of the Third International Conference on Advances in Multimodal Interfaces
Year:
2000

Citing 7
Cited 4

Improved Boosting Algorithms Using Confidence-rated Predictions

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
Multimodal Speaker Detection Using Input/Output Dynamic Bayesian Networks

ICMI '00 Proceedings of the Third International Conference on Advances in Multimodal Interfaces
Vision for a smart kiosk

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Coupled hidden Markov models for complex action recognition

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Neural Network-Based Face Detection

CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Audio-Visual Speaker Detection Using Dynamic Bayesian Networks

FG '00 Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition 2000
A real-time face tracker

WACV '96 Proceedings of the 3rd IEEE Workshop on Applications of Computer Vision (WACV '96)

Multimodal Speaker Detection Using Input/Output Dynamic Bayesian Networks

ICMI '00 Proceedings of the Third International Conference on Advances in Multimodal Interfaces
Multimodal multispeaker probabilistic tracking in meetings

ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Task oriented facial behavior recognition with selective sensing

Computer Vision and Image Understanding
Mobility detection using everyday GSM traces

UbiComp'06 Proceedings of the 8th international conference on Ubiquitous Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Inferring users' actions and intentions forms an integral part of design and development of any human-computer interface. The presence of noisy and at times ambiguous sensory data makes this problem challenging. We formulate a framework for temporal fusion of multiple sensors using input-output dynamic Bayesian networks (IODBNs). We find that contextual information about the state of the computer interface, used as an input to the DBN, and sensor distributions learned from data are crucial for good detection performance. Nevertheless, classical DBN learning methods can cause such models to fail when the data exhibits complex behavior. To further improve the detection rate we formulate an error-feedback learning strategy for DBNs. We apply this framework to the problem of audio/visual speaker detection in an interactive kiosk application using "off-the-shelf" visual and audio sensors (face, skin, texture, mouth motion, and silence detectors). Detection results obtained in this setup demonstrate numerous benefits of our learning-based framework.