A multi-modal approach for determining speaker location and focus

Authors:
Michael Siracusa;Louis-Philippe Morency;Kevin Wilson;John Fisher;Trevor Darrell
Affiliations:
Computer Science and Artificial Intelligence Laboratory at MIT, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory at MIT, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory at MIT, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory at MIT, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory at MIT, Cambridge, MA
Venue:
Proceedings of the 5th international conference on Multimodal interfaces
Year:
2003

Citing 6
Cited 11

Audio-visual tracking for natural interactivity

MULTIMEDIA '99 Proceedings of the seventh ACM international conference on Multimedia (Part 1)
Array Signal Processing: Concepts and Techniques

Array Signal Processing: Concepts and Techniques
Face-Responsive Interfaces: From Direct Manipulation to Perceptive Presence

UbiComp '02 Proceedings of the 4th international conference on Ubiquitous Computing
Integrated Person Tracking Using Stereo, Color, and Pattern Detection

CVPR '98 Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
A Vision-Based Microphone Switch for Speech Intent Detection

RATFG-RTS '01 Proceedings of the IEEE ICCV Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems (RATFG-RTS'01)
Adaptive view-based appearance models

CVPR'03 Proceedings of the 2003 IEEE computer society conference on Computer vision and pattern recognition

From conversational tooltips to grounded discourse: head poseTracking in interactive dialog systems

Proceedings of the 6th international conference on Multimodal interfaces
Contextual recognition of head gestures

ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Identifying the intended addressee in mixed human-human and human-computer interaction from non-verbal features

ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Kimono: kiosk-mobile phone knowledge sharing system

MUM '05 Proceedings of the 4th international conference on Mobile and ubiquitous multimedia
Head gestures for perceptual interfaces: The role of context in improving recognition

Artificial Intelligence
Speaker separation and tracking system

EURASIP Journal on Applied Signal Processing
Multimodalcues for addressee-hood in triadic communication with a human information retrieval agent

Proceedings of the 9th international conference on Multimodal interfaces
Co-occurrence graphs: contextual representation for head gesture recognition during multi-party interactions

Proceedings of the Workshop on Use of Context in Vision Processing
A two-stage multimodal speaker location-aware approach in pervasive computing

International Journal of Computer Applications in Technology
Using reverberation to improve range and elevation discrimination for small array sound source localization

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on processing reverberant speech: methodologies and applications
Robust user context analysis for multimodal interfaces

ICMI '11 Proceedings of the 13th international conference on multimodal interfaces

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a multi-modal approach to locate a speaker in a scene and determine to whom he or she is speaking. We present a simple probabilistic framework that combines multiple cues derived from both audio and video information. A purely visual cue is obtained using a head tracker to identify possible speakers in a scene and provide both their 3-D positions and orientation. In addition, estimates of the audio signal's direction of arrival are obtained with the help of a two-element microphone array. A third cue measures the association between the audio and the tracked regions in the video. Integrating these cues provides a more robust solution than using any single cue alone. The usefulness of our approach is shown in our results for video sequences with two or more people in a prototype interactive kiosk environment.