Learning speaker, addressee and overlap detection models from multimodal streams

Authors:
Oriol Vinyals;Dan Bohus;Rich Caruana
Affiliations:
University of California at Berkeley, Berkeley, CA, USA;Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA
Venue:
Proceedings of the 14th ACM international conference on Multimodal interaction
Year:
2012

Citing 13
Cited 1

Bagging predictors

Machine Learning
Random Forests

Machine Learning
Learning Comprehensible Descriptions of Multivariate Time Series

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Identifying the addressee in human-human-robot interactions based on head pose and speech

Proceedings of the 6th international conference on Multimodal interfaces
Properties and benefits of calibrated classifiers

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Identifying the intended addressee in mixed human-human and human-computer interaction from non-verbal features

ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Multimodal multispeaker probabilistic tracking in meetings

ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
A probabilistic inference of multiparty-conversation structure based on Markov-switching models of gaze patterns, head directions, and utterances

ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
An empirical comparison of supervised learning algorithms

ICML '06 Proceedings of the 23rd international conference on Machine learning
Context-based recognition during human interactions: automatic feature selection and encoding dictionary

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Dialog in the open world: platform and applications

Proceedings of the 2009 international conference on Multimodal interfaces
Decisions about turns in multiparty conversation: from perception to action

ICMI '11 Proceedings of the 13th international conference on multimodal interfaces
Multiparty turn taking in situated dialog: study, lessons, and directions

SIGDIAL '11 Proceedings of the SIGDIAL 2011 Conference

Implementation and evaluation of a multimodal addressee identification mechanism for multiparty conversation systems

Proceedings of the 15th ACM on International conference on multimodal interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

A key challenge in developing conversational systems is fusing streams of information provided by different sensors to make inferences about the behaviors and goals of people. Such systems can leverage visual and audio information collected through cameras and microphone arrays, including the location of various people, their focus of attention, body pose, the sound source direction, prosody, and speech recognition results. In this paper, we explore discriminative learning techniques for making accurate inferences on the problems of speaker, addressee and overlap detection in multiparty human-computer dialog. The focus is on finding ways to leverage within- and across-signal temporal patterns and to automatically construct representations from the raw streams that are informative for the inference problem. We present a novel extension to traditional decision trees which allows them to incorporate and model temporal signals. We contrast these methods with more traditional approaches where a human expert manually engineers relevant temporal features. The proposed approach performs well even with relatively small amounts of training data, which is of practical importance as designing features that are task dependent is time consuming and not always possible.