Synthesizing a talking mouth

Authors:
Ziheng Zhou;Guoying Zhao;Matti Pietikäinen
Affiliations:
University of Oulu, Oulu, Finland;University of Oulu, Oulu, Finland;University of Oulu, Oulu, Finland
Venue:
Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing
Year:
2010

Citing 15
Cited 0

Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Video Rewrite: driving visual speech with audio

Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Voice puppetry

Proceedings of the 26th annual conference on Computer graphics and interactive techniques
Active Appearance Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Facial animation framework for the web and mobile platforms

Proceedings of the seventh international conference on 3D Web technology
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Trainable videorealistic speech animation

Proceedings of the 29th annual conference on Computer graphics and interactive techniques
Expressive speech-driven facial animation

ACM Transactions on Graphics (TOG)
Neighborhood Preserving Embedding

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces

IEEE Transactions on Visualization and Computer Graphics
Graph Embedding and Extensions: A General Framework for Dimensionality Reduction

IEEE Transactions on Pattern Analysis and Machine Intelligence
A coupled HMM approach to video-realistic speech animation

Pattern Recognition
Lipreading with local spatiotemporal descriptors

IEEE Transactions on Multimedia
Photo-realistic talking-heads from image samples

IEEE Transactions on Multimedia
Speech-driven facial animation with realistic dynamics

IEEE Transactions on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a visually realistic animation system for synthesizing a talking mouth. Video synthesis is achieved by first learning generative models from the recorded speech videos and then using the learned models to generate videos for novel utterances. A generative model considers the whole utterance contained in a video as a continuous process and represents it using a set of trigonometric functions embedded within a path graph. The transformation that projects the values of the functions to the image space is found through graph embedding. Such a model allows us to synthesize mouth images at arbitrary positions in the utterance. To synthesize a video for a novel utterance, the utterance is first compared with the existing ones from which we find the phoneme combinations that best approximate the utterance. Based on the learned models, dense videos are synthesized, concatenated and downsampled. A new generative model is then built on the remaining image samples for the final video synthesis.