Exploring Co-Occurence Between Speech and Body Movement for Audio-Guided Video Localization

  • Authors:
  • H. Vajaria;S. Sarkar;R. Kasturi

  • Affiliations:
  • South Florida Univ., Tampa, FL;-;-

  • Venue:
  • IEEE Transactions on Circuits and Systems for Video Technology
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a bottom-up approach that combines audio and video to simultaneously locate individual speakers in the video (2D source localization) and segment their speech (speaker diarization), in meetings recorded by a single stationary camera and a single microphone. The novelty lies in using motion information from the entire body rather than just the face to perform these tasks, which permits processing nonfrontal views, unlike previous work. Since body movements do not exhibit instantaneous signal-level synchrony with speech, the approach targets long term co-occurrences between audio and video subspaces. First, temporal clustering of the audio produces a large number of intermediate clusters, each containing speech from only a single speaker. Then, spatial clustering is performed in the video frames of each cluster by a novel eigen-analysis method to find the region of dominant motion. This region is associated with the speech assuming that a speaker exhibits more movement than the listeners. Thus, partial diarization and localization is obtained from the intermediate clusters. Speech from an intermediate cluster is modeled by a mixture of Gaussians and the speaker's location is represented by an eigen-blob model. In the ensuing iterative clustering stage, the diarization and localization results are progressively refined by merging the closest pair of clusters and updating the models until a stop criterion is met. Ideally, each final cluster contains all the speech from a single speaker and the corresponding eigen-blob model localizes the speaker in the image. Experiments conducted on 21 h of real data indicate that the proposed localization approach leads to a relative improvement of 40% over mutual information-based localization and that speaker diarization improves by 16% by incorporating visual information. The proposed approach does not require training and does not rely on a priori hand/face/person detection.