A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization

  • Authors:
  • Kazuhiro Otsuka;Shoko Araki;Kentaro Ishizuka;Masakiyo Fujimoto;Martin Heinrich;Junji Yamato

  • Affiliations:
  • NTT Communication Science Labs, Atsugi, Japan;NTT Communication Science Labs, Kyoto, Japan;NTT Communication Science Labs, Kyoto, Japan;NTT Communication Science Labs, Kyoto, Japan;NTT Communication Science Labs, Atsugi, Japan;NTT Communication Science Labs, Atsugi, Japan

  • Venue:
  • ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a realtime system for analyzing group meetings that uses a novel omnidirectional camera-microphone system. The goal is to automatically discover the visual focus of attention (VFOA), i.e. "who is looking at whom", in addition to speaker diarization, i.e. "who is speaking and when". First, a novel tabletop sensing device for round-table meetings is presented; it consists of two cameras with two fisheye lenses and a triangular microphone array. Second, from high-resolution omnidirectional images captured with the cameras, the position and pose of people's faces are estimated by STCTracker (Sparse Template Condensation Tracker); it realizes realtime robust tracking of multiple faces by utilizing GPUs (Graphics Processing Units). The face position/pose data output by the face tracker is used to estimate the focus of attention in the group. Using the microphone array, robust speaker diarization is carried out by a VAD (Voice Activity Detection) and a DOA (Direction of Arrival) estimation followed by sound source clustering. This paper also presents new 3-D visualization schemes for meeting scenes and the results of an analysis. Using two PCs, one for vision and one for audio processing, the system runs at about 20 frames per second for 5-person meetings.