A multi-modal approach for determining speaker location and focus

  • Authors:
  • Michael Siracusa;Louis-Philippe Morency;Kevin Wilson;John Fisher;Trevor Darrell

  • Affiliations:
  • Computer Science and Artificial Intelligence Laboratory at MIT, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory at MIT, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory at MIT, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory at MIT, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory at MIT, Cambridge, MA

  • Venue:
  • Proceedings of the 5th international conference on Multimodal interfaces
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a multi-modal approach to locate a speaker in a scene and determine to whom he or she is speaking. We present a simple probabilistic framework that combines multiple cues derived from both audio and video information. A purely visual cue is obtained using a head tracker to identify possible speakers in a scene and provide both their 3-D positions and orientation. In addition, estimates of the audio signal's direction of arrival are obtained with the help of a two-element microphone array. A third cue measures the association between the audio and the tracked regions in the video. Integrating these cues provides a more robust solution than using any single cue alone. The usefulness of our approach is shown in our results for video sequences with two or more people in a prototype interactive kiosk environment.