Visual Speech Synthesis by Morphing Visemes

Authors:
Tony Ezzat;Tomaso Poggio
Affiliations:
Center for Biological and Computational Learning, Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA. tonebone@ai.mit.edu;Center for Biological and Computational Learning, Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA. tp@ai.mit.edu
Venue:
International Journal of Computer Vision - special issue on learning and vision at the center for biological and computational learning, Massachusetts Institute of Technology
Year:
2000

Citing 19
Cited 13

Speech and expression: a computer solution to face animation

Proceedings on Graphics Interface '86/Vision Interface '86
Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

Speech Communication
Two-dimensional signal and image processing

Two-dimensional signal and image processing
Feature-based image metamorphosis

SIGGRAPH '92 Proceedings of the 19th annual conference on Computer graphics and interactive techniques
View interpolation for image synthesis

SIGGRAPH '93 Proceedings of the 20th annual conference on Computer graphics and interactive techniques
Performance of optical flow techniques

International Journal of Computer Vision
Realistic modeling for facial animation

SIGGRAPH '95 Proceedings of the 22nd annual conference on Computer graphics and interactive techniques
Image metamorphosis using snakes and free-form deformations

SIGGRAPH '95 Proceedings of the 22nd annual conference on Computer graphics and interactive techniques
View morphing

SIGGRAPH '96 Proceedings of the 23rd annual conference on Computer graphics and interactive techniques
Video Rewrite: driving visual speech with audio

Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Image-based view synthesis by combining trilinear tensors and learning techniques

VRST '97 Proceedings of the ACM symposium on Virtual reality software and technology
Making faces

Proceedings of the 25th annual conference on Computer graphics and interactive techniques
Synthesizing realistic facial expressions from photographs

Proceedings of the 25th annual conference on Computer graphics and interactive techniques
Digital Image Warping

Digital Image Warping
Active Appearance Models

ECCV '98 Proceedings of the 5th European Conference on Computer Vision-Volume II - Volume II
Sample-Based Synthesis of Photo-Realistic Talking Heads

CA '98 Proceedings of the Computer Animation
Example Based Image Analysis and Synthesis

Example Based Image Analysis and Synthesis
A parametric model for human faces.

A parametric model for human faces.
Multidimensional Morphable Models

ICCV '98 Proceedings of the Sixth International Conference on Computer Vision

Model-based face and lip animation for interactive virtual reality applications

MULTIMEDIA '01 Proceedings of the ninth ACM international conference on Multimedia
Trainable videorealistic speech animation

Proceedings of the 29th annual conference on Computer graphics and interactive techniques
Face transfer with multilinear models

ACM SIGGRAPH 2005 Papers
Face transfer with multilinear models

ACM SIGGRAPH 2006 Courses
Simulating speech with a physics-based facial muscle model

Proceedings of the 2006 ACM SIGGRAPH/Eurographics symposium on Computer animation
Lips animation based on Japanese phoneme context for an automatic reading system with emotion

Proceedings of the 13th international conference on Intelligent user interfaces
Persian Viseme Classification for Developing Visual Speech Training Application

PCM '09 Proceedings of the 10th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
A comprehensive audio-visual corpus for teaching sound persian phoneme articulation

SMC'09 Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics
Trainable videorealistic speech animation

FGR' 04 Proceedings of the Sixth IEEE international conference on Automatic face and gesture recognition
The persian linguistic based audio-visual data corpus, AVA II, considering coarticulation

MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling
Lip-reading: furhat audio visual intelligibility of a back projected animated face

IVA'12 Proceedings of the 12th international conference on Intelligent Virtual Agents
Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis

Speech Communication
Clustering Persian viseme using phoneme subspace for developing visual speech application

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every other viseme is computed automatically. By morphing along this correspondence, a smooth transition between viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a photorealistic talking face.