Trainable videorealistic speech animation

Authors:
Tony Ezzat;Gadi Geiger;Tomaso Poggio
Affiliations:
Center for Biological and Computational Learning, Massachusetts Institute of Technology, Cambridge, MA;Center for Biological and Computational Learning, Massachusetts Institute of Technology, Cambridge, MA;Center for Biological and Computational Learning, Massachusetts Institute of Technology, Cambridge, MA
Venue:
FGR' 04 Proceedings of the Sixth IEEE international conference on Automatic face and gesture recognition
Year:
2004

Citing 23
Cited 0

Speech and expression: a computer solution to face animation

Proceedings on Graphics Interface '86/Vision Interface '86
A muscle model for animation three-dimensional facial expression

SIGGRAPH '87 Proceedings of the 14th annual conference on Computer graphics and interactive techniques
Introduction to algorithms

Introduction to algorithms
Performance of optical flow techniques

International Journal of Computer Vision
Realistic modeling for facial animation

SIGGRAPH '95 Proceedings of the 22nd annual conference on Computer graphics and interactive techniques
Video Rewrite: driving visual speech with audio

Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Making faces

Proceedings of the 25th annual conference on Computer graphics and interactive techniques
Synthesizing realistic facial expressions from photographs

Proceedings of the 25th annual conference on Computer graphics and interactive techniques
EM algorithms for PCA and SPCA

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Voice puppetry

Proceedings of the 26th annual conference on Computer graphics and interactive techniques
A morphable model for the synthesis of 3D faces

Proceedings of the 26th annual conference on Computer graphics and interactive techniques
Robustly estimating changes in image appearance

Computer Vision and Image Understanding - Special issue on robusst statistical techniques in image understanding
Visual Speech Synthesis by Morphing Visemes

International Journal of Computer Vision - special issue on learning and vision at the center for biological and computational learning, Massachusetts Institute of Technology
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Trainable videorealistic speech animation

Proceedings of the 29th annual conference on Computer graphics and interactive techniques
Polymorph: Morphing Among Multiple Images

IEEE Computer Graphics and Applications
Hierarchical Model-Based Motion Estimation

ECCV '92 Proceedings of the Second European Conference on Computer Vision
Active Appearance Models

ECCV '98 Proceedings of the 5th European Conference on Computer Vision-Volume II - Volume II
Sample-Based Synthesis of Photo-Realistic Talking Heads

CA '98 Proceedings of the Computer Animation
Recognition and Structure from one 2D Model View: Observations on Prototypes, Object Classes and Symmetries

Recognition and Structure from one 2D Model View: Observations on Prototypes, Object Classes and Symmetries
Priors Stabilizers and Basis Functions: From Regularization to Radial, Tensor and Additive Splines

Priors Stabilizers and Basis Functions: From Regularization to Radial, Tensor and Additive Splines
Example Based Image Analysis and Synthesis

Example Based Image Analysis and Synthesis
A parametric model for human faces.

A parametric model for human faces.

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe how to create with machine learning techniques a generative, videorealistic, speech animation module. A human subject is first recorded using a videocamera as he/she utters a pre-determined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject's mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. At run time, the input to the system can be either real audio sequences or synthetic audio produced by a text-to-speech system, as long as they have been phonetically aligned.