Machine Learning
Machine Learning
Learning Comprehensible Descriptions of Multivariate Time Series
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Identifying the addressee in human-human-robot interactions based on head pose and speech
Proceedings of the 6th international conference on Multimodal interfaces
Properties and benefits of calibrated classifiers
PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Multimodal multispeaker probabilistic tracking in meetings
ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
An empirical comparison of supervised learning algorithms
ICML '06 Proceedings of the 23rd international conference on Machine learning
ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Dialog in the open world: platform and applications
Proceedings of the 2009 international conference on Multimodal interfaces
Decisions about turns in multiparty conversation: from perception to action
ICMI '11 Proceedings of the 13th international conference on multimodal interfaces
Multiparty turn taking in situated dialog: study, lessons, and directions
SIGDIAL '11 Proceedings of the SIGDIAL 2011 Conference
Proceedings of the 15th ACM on International conference on multimodal interaction
Hi-index | 0.00 |
A key challenge in developing conversational systems is fusing streams of information provided by different sensors to make inferences about the behaviors and goals of people. Such systems can leverage visual and audio information collected through cameras and microphone arrays, including the location of various people, their focus of attention, body pose, the sound source direction, prosody, and speech recognition results. In this paper, we explore discriminative learning techniques for making accurate inferences on the problems of speaker, addressee and overlap detection in multiparty human-computer dialog. The focus is on finding ways to leverage within- and across-signal temporal patterns and to automatically construct representations from the raw streams that are informative for the inference problem. We present a novel extension to traditional decision trees which allows them to incorporate and model temporal signals. We contrast these methods with more traditional approaches where a human expert manually engineers relevant temporal features. The proposed approach performs well even with relatively small amounts of training data, which is of practical importance as designing features that are task dependent is time consuming and not always possible.