Speaker association with signal-level audiovisual fusion

  • Authors:
  • J. W. Fisher, III;T. Darrell

  • Affiliations:
  • Comput. Sci. & Artificial Intelligence Lab., Massachusetts Inst. of Technol., Cambridge, MA, USA;-

  • Venue:
  • IEEE Transactions on Multimedia
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Audio and visual signals arriving from a common source are detected using a signal-level fusion technique. A probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence. Nonparametric statistical density modeling techniques can characterize the mutual information between signals from different domains. By comparing the mutual information between different pairs of signals, it is possible to identify which person is speaking a given utterance and discount errant motion or audio from other utterances or nonspeech events.