Time-delay neural networks for estimating lip movements from speech analysis: a useful tool in audio-video synchronization

  • Authors:
  • F. Lavagetto

  • Affiliations:
  • Dept. of Commun. Comput., Genoa Univ.

  • Venue:
  • IEEE Transactions on Circuits and Systems for Video Technology
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

A new technology is proposed for audio-video synchronization in multimedia applications where talking human faces, either natural or synthetic, are employed for interpersonal communication services, home gaming, advanced multimodal interfaces, interactive entertainment, or in movie production. Facial sequences, in fact, represent an acoustic-visual source characterized by two strongly correlated components: a talking face and the associated speech, whose synchronous presentation must be guaranteed in any multimedia application. Therefore, the exact timing for displaying a video frame or for generating a synthetic facial image has to be supervised by some form of speech analysis performed either as preprocessing before encoding or as postprocessing before presentation. Experimental results are reported on the use of time-delay neural networks (TDNN) for the direct estimation of the visible articulation of the mouth starting from the coherent analysis of acoustic speech. The architectural solution of employing a bank of independent single-output TDNNs has been compared to the alternative solution of using only a single multi-output TDNN. Similarly, two different learning procedures have been applied and compared for training the TDNN, the first based on the classic mean square error (MSE) and the second based on a measure of cross-correlation (CC). The superiority of the system based on multiple single-output TDNNs has been proved as well as the improvements, both in terms of convergence speed and estimation fidelity, achievable through the learning algorithm based on cross-correlation