Speech-Video Synchronization Using Lips Movements and Speech Envelope Correlation

Authors:
Amar A. El-Sallam;Ajmal S. Mian
Affiliations:
School of Electrical, Electronic and Computer Engineering,;School of Computer Science and Software Engineering, The University of Western Australia, Australia 6009
Venue:
ICIAR '09 Proceedings of the 6th International Conference on Image Analysis and Recognition
Year:
2009

Citing 2
Cited 0

Speech analysis and synthesis using an AM-FM modulation model

Speech Communication
2D and 3d multimodal hybrid face recognition

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part III

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a novel correlation based method for speech-video synchronization (synch) and relationship classification. The method uses the envelope of the speech signal and data extracted from the lips movement. Firstly, a nonlinear-time-varying model is considered to represent the speech signal as a sum of amplitude and frequency modulated (AM-FM) signals. Each AM-FM signal, in this sum, is considered to model a single speech formant frequency. Using Taylor series expansion, the model is formulated in a way which characterizes the relation between the speech amplitude and the instantaneous frequency of each AM-FM signal w.r.t lips movements. Secondly, the envelope of the speech signal is estimated and then correlated with signals generated from lips movement. From the resultant correlation, the relation between the two signals is classified and the delay between them is estimated. The proposed method is applied to real cases and the results show that it is able to (i) classify if the speech and the video signals belong to the same source, (ii) estimate delays between audio and video signals that are as small as 0.1 second when speech signals are noisy and 0.04 second when the additive noise is less significant.