Real-Time Automated Video and Audio Capture with Multiple Cameras and Microphones

Authors:
Ce Wang;Scott Griebel;Michael Brandstein;Bo-June (Paul) Hsu
Affiliations:
Division of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA;Division of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA;Division of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA;Microsoft Corporation, Redmond, WA 98052, USA
Venue:
Journal of VLSI Signal Processing Systems
Year:
2001

Citing 9
Cited 1

Fundamentals of digital image processing

Fundamentals of digital image processing
Active vision

Active vision
Head Pose Computation for Very Low Bit-rate Video Coding

CAIP '95 Proceedings of the 6th International Conference on Computer Analysis of Images and Patterns
LAFTER: Lips and Face Real-Time Tracker

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Parametrized structure from motion for 3D adaptive feedback tracking of faces

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Neural Network-Based Face Detection

CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Head Pose Determination from One Image Using a Generic Model

FG '98 Proceedings of the 3rd. International Conference on Face & Gesture Recognition
A real-time face tracker

WACV '96 Proceedings of the 3rd IEEE Workshop on Applications of Computer Vision (WACV '96)
Voice Source Localization for Automatic Camera Pointing System in Videoconferencing

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97) -Volume 1 - Volume 1

Direction of arrival estimation improvement of speech on a two-microphone array

SIP '07 Proceedings of the Ninth IASTED International Conference on Signal and Image Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work presents the acoustic and visual-based tracking system functioning at the Harvard Intelligent Multi-Media Environments Laboratory (HIMMEL). The environment is populated with a number of microphones and steerable video cameras. Acoustic source localization, video-based face tracking and pose estimation, and multi-channel speech enhancement methods are applied in combination to detect and track individuals in a practical environment while also providing an improved audio signal to accompany the video stream. The video portion of the system tracks talkers by utilizing source motion, contour geometry, color data, and simple facial features. Decisions involving which camera to use are based on an estimate of the head's gazing angle. This head pose estimation is achieved using a very general head model which employs hairline features and a learned network classification procedure. Finally, a beamforming and postfiltering microphone array technique is used to create an enhanced speech waveform to accompany the recorded video signal. The system presented in this paper is robust to both visual clutter (e.g. ovals in the scene of interest which are not faces) and audible noise (e.g. reverberations and background noise).