Matrix computations (3rd ed.)
Video Rewrite: driving visual speech with audio
Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Proceedings of the 26th annual conference on Computer graphics and interactive techniques
IEEE Transactions on Pattern Analysis and Machine Intelligence
Facial animation framework for the web and mobile platforms
Proceedings of the seventh international conference on 3D Web technology
Neural Networks for Pattern Recognition
Neural Networks for Pattern Recognition
Trainable videorealistic speech animation
Proceedings of the 29th annual conference on Computer graphics and interactive techniques
Expressive speech-driven facial animation
ACM Transactions on Graphics (TOG)
Neighborhood Preserving Embedding
ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces
IEEE Transactions on Visualization and Computer Graphics
Graph Embedding and Extensions: A General Framework for Dimensionality Reduction
IEEE Transactions on Pattern Analysis and Machine Intelligence
A coupled HMM approach to video-realistic speech animation
Pattern Recognition
Lipreading with local spatiotemporal descriptors
IEEE Transactions on Multimedia
Photo-realistic talking-heads from image samples
IEEE Transactions on Multimedia
Speech-driven facial animation with realistic dynamics
IEEE Transactions on Multimedia
Hi-index | 0.00 |
This paper presents a visually realistic animation system for synthesizing a talking mouth. Video synthesis is achieved by first learning generative models from the recorded speech videos and then using the learned models to generate videos for novel utterances. A generative model considers the whole utterance contained in a video as a continuous process and represents it using a set of trigonometric functions embedded within a path graph. The transformation that projects the values of the functions to the image space is found through graph embedding. Such a model allows us to synthesize mouth images at arbitrary positions in the utterance. To synthesize a video for a novel utterance, the utterance is first compared with the existing ones from which we find the phoneme combinations that best approximate the utterance. Based on the learned models, dense videos are synthesized, concatenated and downsampled. A new generative model is then built on the remaining image samples for the final video synthesis.