Speech-to-Lip Movement Synthesis by Maximizing Audio-Visual Joint Probability Based on the EM Algorithm

  • Authors:
  • Satoshi Nakamura;Eli Yamamoto

  • Affiliations:
  • ATR Spoken Language Translation Research Laboratories, 2-2 Hikaridai Seika-cho Soraku-gun Kyoto 619-0288, Japan;Faculty of Systems Engineering, Wakayama University, 930 Sakaedani Wakayama 640-8510, Japan

  • Venue:
  • Journal of VLSI Signal Processing Systems - Special issue on multimedia signal processing
  • Year:
  • 2001

Quantified Score

Hi-index 0.02

Visualization

Abstract

In this paper, we investigate a Hidden Markov Model (HMM)-based method to drive a lip movement sequence with input speech. In a previous study, we have already investigated a mapping method based on the Viterbi decoding algorithm which converts an input speech signal to a lip movement sequence through the most likely HMM state sequence using audio HMMs. However, the method can result in errors due to incorrectly decoded HMM states. This paper proposes a method to re-estimate visual parameters using HMMs of audio-visual joint probability using the Expectation-Maximization (EM) algorithm. In the experiments, the proposed mapping method results in a 26% error reduction when compared to the Viterbi-based algorithm at incorrectly decoded bilabial consonants.