Learning speaker, addressee and overlap detection models from multimodal streams

  • Authors:
  • Oriol Vinyals;Dan Bohus;Rich Caruana

  • Affiliations:
  • University of California at Berkeley, Berkeley, CA, USA;Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA

  • Venue:
  • Proceedings of the 14th ACM international conference on Multimodal interaction
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

A key challenge in developing conversational systems is fusing streams of information provided by different sensors to make inferences about the behaviors and goals of people. Such systems can leverage visual and audio information collected through cameras and microphone arrays, including the location of various people, their focus of attention, body pose, the sound source direction, prosody, and speech recognition results. In this paper, we explore discriminative learning techniques for making accurate inferences on the problems of speaker, addressee and overlap detection in multiparty human-computer dialog. The focus is on finding ways to leverage within- and across-signal temporal patterns and to automatically construct representations from the raw streams that are informative for the inference problem. We present a novel extension to traditional decision trees which allows them to incorporate and model temporal signals. We contrast these methods with more traditional approaches where a human expert manually engineers relevant temporal features. The proposed approach performs well even with relatively small amounts of training data, which is of practical importance as designing features that are task dependent is time consuming and not always possible.