Detection of a speaker in video by combined analysis of speech sound and mouth movement

  • Authors:
  • Osamu Ikeda

  • Affiliations:
  • Faculty of Engineering, Takushoku University, Hachioji, Tokyo, Japan

  • Venue:
  • ISVC'07 Proceedings of the 3rd international conference on Advances in visual computing - Volume Part II
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a robust method to detect and locate a speaker using a joint analysis of speech sound and video image. First, the short speech sound data is analyzed to estimate the rate of spoken syllables, and a difference image is formed using the optimal frame distance derived from the rate to detect the candidates of mouth. Then, they are tracked to positively prove that one of the candidates is the mouth; the rate of mouth movements is estimated from the brightness change profiles for the first candidate and, if both the rates agree, the three brightest parts are detected in the resulting difference image as mouth and eyes. If not, the second candidate is tracked and so on. The first-order moment of the power spectrum of the brightness change profile and the lateral shifts in the tracking are also used to check whether or not they are facial parts.