A multimodal speaker detection and tracking system for teleconferencing

Authors:
Billibon H. Yoshimi;Gopal S. Pingali
Affiliations:
IBM T. J. Watson Research Lab, Yorktown Heights, NY;IBM T. J. Watson Research Lab, Hawthorne, NY
Venue:
Proceedings of the tenth ACM international conference on Multimedia
Year:
2002

Citing 0
Cited 4

Multimodal processing by finding common cause

Communications of the ACM - Multimodal interfaces that flex, adapt, and persist
Augmented collaborative spaces

ETP '03 Proceedings of the 2003 ACM SIGMM workshop on Experiential telepresence
Towards a taxonomy of error-handling strategies in recognition-based multi-modal human-computer interfaces

Signal Processing - Special section: Multimodal human-computer interfaces
Robust user context analysis for multimodal interfaces

ICMI '11 Proceedings of the 13th international conference on multimodal interfaces

Quantified Score

Hi-index	0.00

Visualization

Abstract

A serious problem in both audio and video conferencing facilities available today is the difficulty in determining who is speaking among a large number of participants. There is a strong need for developing meeting room infrastructure and teleconference facilities that improve the sense of presence and participation experienced in remote meetings. We present a distributed multimodal tracking system that uses multiple cameras and microphones to automatically select the current speaker among multiple meeting participants. The system actively obtains and transmits video showing a good view of the selected speaker. The tracking system is integrated into a web-based video conferencing application that connects seven meeting rooms around the globe. An important part of designing such a system is to determine sensor placement and configuration through systematic experiments in the actual rooms where the system is deployed.