A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization

Authors:
Kazuhiro Otsuka;Shoko Araki;Kentaro Ishizuka;Masakiyo Fujimoto;Martin Heinrich;Junji Yamato
Affiliations:
NTT Communication Science Labs, Atsugi, Japan;NTT Communication Science Labs, Kyoto, Japan;NTT Communication Science Labs, Kyoto, Japan;NTT Communication Science Labs, Kyoto, Japan;NTT Communication Science Labs, Atsugi, Japan;NTT Communication Science Labs, Atsugi, Japan
Venue:
ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Year:
2008

Citing 10
Cited 15

Robust Real-Time Face Detection

International Journal of Computer Vision
Face Tracking in Meeting Room Scenarios Using Omnidirectional Views

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 4 - Volume 04
A probabilistic inference of multiparty-conversation structure based on Markov-switching models of gaze patterns, head directions, and utterances

ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Tracking head pose and focus of attention with multiple far-field cameras

Proceedings of the 8th international conference on Multimodal interfaces
Real-time Visual Tracker by Stream Processing

Journal of Signal Processing Systems
Robust real time face tracking for the analysis of human behaviour

MLMI'07 Proceedings of the 4th international conference on Machine learning for multimodal interaction
VACE multimodal meeting corpus

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
A study on visual focus of attention recognition from head pose in a meeting room

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Multi-person tracking in meetings: a comparative study

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Modeling focus of attention for meeting indexing based on multiple cues

IEEE Transactions on Neural Networks

Automatic nonverbal analysis of social interaction in small groups: A review

Image and Vision Computing
Investigating the use of visual focus of attention for audio-visual speaker diarisation

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Predicting remote versus collocated group interactions using nonverbal cues

Proceedings of the ICMI-MLMI '09 Workshop on Multimodal Sensor-Based Systems and Mobile Phones for Social Computing
A speaker diarization method based on the probabilistic fusion of audio-visual location information

Proceedings of the 2009 international conference on Multimodal interfaces
Recognizing communicative facial expressions for discovering interpersonal emotions in group meetings

Proceedings of the 2009 international conference on Multimodal interfaces
Realtime meeting analysis and 3D meeting viewer based on omnidirectional multimodal sensors

Proceedings of the 2009 international conference on Multimodal interfaces
Speech activity detection for multi-party conversation analyses based on likelihood ratio test on spatial magnitude

IEEE Transactions on Audio, Speech, and Language Processing
Memory-based particle filter for tracking objects with large variation in pose and appearance

ECCV'10 Proceedings of the 11th European conference on computer vision conference on Computer vision: Part III
Multimodal conversation scene analysis for understanding people's communicative behaviors in face-to-face meetings

HCII'11 Proceedings of the 1st international conference on Human interface and the management of information: interacting with information - Volume Part II
A system for reconstructing multiparty conversation field based on augmented head motion by dynamic projection

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Multimodal cue detection engine for orchestrated entertainment

MMM'12 Proceedings of the 18th international conference on Advances in Multimedia Modeling
Multi-modal sensing and analysis of poster conversations toward smart posterboard

SIGDIAL '12 Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Influence relation estimation based on lexical entrainment in conversation

Speech Communication
ARM-COMS: ARm-Supported embodied COmmunication monitor system

HCI'13 Proceedings of the 15th international conference on Human Interface and the Management of Information: information and interaction for learning, culture, collaboration and business - Volume Part III
Real-time audio-visual analysis for multiperson videoconferencing

Advances in Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a realtime system for analyzing group meetings that uses a novel omnidirectional camera-microphone system. The goal is to automatically discover the visual focus of attention (VFOA), i.e. "who is looking at whom", in addition to speaker diarization, i.e. "who is speaking and when". First, a novel tabletop sensing device for round-table meetings is presented; it consists of two cameras with two fisheye lenses and a triangular microphone array. Second, from high-resolution omnidirectional images captured with the cameras, the position and pose of people's faces are estimated by STCTracker (Sparse Template Condensation Tracker); it realizes realtime robust tracking of multiple faces by utilizing GPUs (Graphics Processing Units). The face position/pose data output by the face tracker is used to estimate the focus of attention in the group. Using the microphone array, robust speaker diarization is carried out by a VAD (Voice Activity Detection) and a DOA (Direction of Arrival) estimation followed by sound source clustering. This paper also presents new 3-D visualization schemes for meeting scenes and the results of an analysis. Using two PCs, one for vision and one for audio processing, the system runs at about 20 frames per second for 5-person meetings.