Visual Speech Synthesis by Morphing Visemes

Authors:
Tony Ezzat;Tomaso Poggio
Affiliations:
-;-
Venue:
Visual Speech Synthesis by Morphing Visemes
Year:
1999

Citing 0
Cited 7

Morphable Models for the Analysis and Synthesis of Complex Motion Patterns

International Journal of Computer Vision - special issue on learning and vision at the center for biological and computational learning, Massachusetts Institute of Technology
Text-to-Audiovisual Speech Synthesizer

VW '00 Proceedings of the Second International Conference on Virtual Worlds
Recognizing Expressions by Direct Estimation of the Parameters of a Pixel Morphable Model

BMCV '02 Proceedings of the Second International Workshop on Biologically Motivated Computer Vision
Image-based Talking Heads using Radial Basis Functions

TPCG '03 Proceedings of the Theory and Practice of Computer Graphics 2003
Towards a Northern Sotho talking head

AFRIGRAPH '07 Proceedings of the 5th international conference on Computer graphics, virtual reality, visualisation and interaction in Africa
Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis

MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
On the importance of audiovisual coherence for the perceived quality of synthesized visual speech

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on animating virtual speakers or singers from audio: Lip-synching facial animation

Quantified Score

Hi-index	0.01

Visualization

Abstract

We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every other viseme is computed automatically. By morphing along this correspondence, a smooth transition between viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a photorealistic talking face.