Speech-Video Synchronization Using Lips Movements and Speech Envelope Correlation

  • Authors:
  • Amar A. El-Sallam;Ajmal S. Mian

  • Affiliations:
  • School of Electrical, Electronic and Computer Engineering,;School of Computer Science and Software Engineering, The University of Western Australia, Australia 6009

  • Venue:
  • ICIAR '09 Proceedings of the 6th International Conference on Image Analysis and Recognition
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose a novel correlation based method for speech-video synchronization (synch) and relationship classification. The method uses the envelope of the speech signal and data extracted from the lips movement. Firstly, a nonlinear-time-varying model is considered to represent the speech signal as a sum of amplitude and frequency modulated (AM-FM) signals. Each AM-FM signal, in this sum, is considered to model a single speech formant frequency. Using Taylor series expansion, the model is formulated in a way which characterizes the relation between the speech amplitude and the instantaneous frequency of each AM-FM signal w.r.t lips movements. Secondly, the envelope of the speech signal is estimated and then correlated with signals generated from lips movement. From the resultant correlation, the relation between the two signals is classified and the delay between them is estimated. The proposed method is applied to real cases and the results show that it is able to (i) classify if the speech and the video signals belong to the same source, (ii) estimate delays between audio and video signals that are as small as 0.1 second when speech signals are noisy and 0.04 second when the additive noise is less significant.