Correlation based speech-video synchronization

Authors:
Amar A. EL-Sallam;Ajmal S. Mian
Affiliations:
School of Electrical, Electronic and Computer Engineering, The University of Western Australia, 35 Stirling Highway Crawley, WA 6009, Australia and School of Computer Science and Software Engineer ...;School of Computer Science and Software Engineering, The University of Western Australia, 35 Stirling Highway Crawley, WA 6009, Australia
Venue:
Pattern Recognition Letters
Year:
2011

Citing 4
Cited 0

Speech analysis and synthesis using an AM-FM modulation model

Speech Communication
Real-Time Continuous Phoneme Recognition System Using Class-Dependent Tied-Mixture HMM With HBT Structure for Speech-Driven Lip-Sync

IEEE Transactions on Multimedia
Real-time face detection and lip feature extraction using field-programmable gate arrays

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Time-delay neural networks for estimating lip movements from speech analysis: a useful tool in audio-video synchronization

IEEE Transactions on Circuits and Systems for Video Technology

Quantified Score

Hi-index	0.10

Visualization

Abstract

This paper presents a novel Lip synchronization technique which investigates the correlation between the speech and lips movements. First, the speech signal is represented as a nonlinear time-varying model which involves a sum of AM-FM signals. Each of these signals is employed to model a single Formant frequency. The model is realized using Taylor series expansion in a way which provides the relationship between the lip shape (width and height) w.r.t. the speech amplitude and instantaneous frequency. Using lips width and height, a semi-speech signal is generated and correlated with the original speech signal over a span of delays then the delay between the speech and the video is estimated. Using real and noisy data from the VidTimit and in-house diastases, the proposed method was able to estimate small delays of 0.01-0.1s in the case of noise-less and noisy signals respectively with a maximum absolute error of 0.0022s.