Speech-to-Lip Movement Synthesis by Maximizing Audio-Visual Joint Probability Based on the EM Algorithm

Authors:
Satoshi Nakamura;Eli Yamamoto
Affiliations:
ATR Spoken Language Translation Research Laboratories, 2-2 Hikaridai Seika-cho Soraku-gun Kyoto 619-0288, Japan;Faculty of Systems Engineering, Wakayama University, 930 Sakaedani Wakayama 640-8510, Japan
Venue:
Journal of VLSI Signal Processing Systems - Special issue on multimedia signal processing
Year:
2001

Citing 3
Cited 1

Lip movement synthesis from speech based on hidden Markov models

Speech Communication - Special issue on auditory-visual speech processing
Audio-Visual Interaction in Multimedia Communication

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97) -Volume 1 - Volume 1
Cross-modal prediction in audio-visual communication

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 04

SynFace: speech-driven facial animation for virtual speech-reading support

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on animating virtual speakers or singers from audio: Lip-synching facial animation

Quantified Score

Hi-index	0.02

Visualization

Abstract

In this paper, we investigate a Hidden Markov Model (HMM)-based method to drive a lip movement sequence with input speech. In a previous study, we have already investigated a mapping method based on the Viterbi decoding algorithm which converts an input speech signal to a lip movement sequence through the most likely HMM state sequence using audio HMMs. However, the method can result in errors due to incorrectly decoded HMM states. This paper proposes a method to re-estimate visual parameters using HMMs of audio-visual joint probability using the Expectation-Maximization (EM) algorithm. In the experiments, the proposed mapping method results in a 26% error reduction when compared to the Viterbi-based algorithm at incorrectly decoded bilabial consonants.