Realistic visual speech synthesis based on hybrid concatenation method

Authors:
Jianhua Tao;Le Xin;Panrong Yin
Affiliations:
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China;National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China;National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Venue:
IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Year:
2009

Citing 19
Cited 1

Least-Squares Fitting of Two 3-D Point Sets

IEEE Transactions on Pattern Analysis and Machine Intelligence
Video Rewrite: driving visual speech with audio

Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Lip movement synthesis from speech based on hidden Markov models

Speech Communication - Special issue on auditory-visual speech processing
Voice puppetry

Proceedings of the 26th annual conference on Computer graphics and interactive techniques
Speech-driven cartoon animation with emotions

MULTIMEDIA '01 Proceedings of the ninth ACM international conference on Multimedia
MikeTalk: A Talking Facial Display Based on Morphing Visemes

CA '98 Proceedings of the Computer Animation
Coupled hidden Markov models for complex action recognition

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Emotional Chinese talking head system

Proceedings of the 6th international conference on Multimodal interfaces
Audio-Visual Affect Recognition through Multi-Stream Fused HMM for HCI

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02
Automatic 3D Face Modeling from Video

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Speech Animation Using Coupled Hidden Markov Models

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 01
Dynamic mapping method based speech driven face animation system

ACII'05 Proceedings of the First international conference on Affective Computing and Intelligent Interaction
A fused hidden Markov model with application to bimodal speech processing

IEEE Transactions on Signal Processing
Prosody conversion from neutral speech to emotional speech

IEEE Transactions on Audio, Speech, and Language Processing
Animating expressive faces across languages

IEEE Transactions on Multimedia
Speech-driven facial animation with realistic dynamics

IEEE Transactions on Multimedia
Audio/visual mapping with cross-modal hidden Markov models

IEEE Transactions on Multimedia
Learning dynamic audio-visual mapping with input-output Hidden Markov models

IEEE Transactions on Multimedia
Real-time speech-driven face animation with expressions using neural networks

IEEE Transactions on Neural Networks

Microintonation analysis of emotional speech

COST'09 Proceedings of the Second international conference on Development of Multimodal Interfaces: active Listening and Synchrony

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a realistic visual speech synthesis based on the hybrid concatenation method. Unlike previous methods based on phoneme level unit selection or hidden Markov model (HMM), etc., the hybrid concatenation method uses a frame level-based unit selection method combined with a fused HMM, and is able to generate more expressive and stable facial animations. The fused HMM can be used to explicitly model the loose synchronization of tightly coupled streams, with much better results than a normal HMM for audiovisual mapping. After fused HMM is created, facial animation is generated via the unit selection method at the frame level by using the fused HMM output probabilities. To accelerate the computing efficiency of the unit selection on a large corpus, this paper also proposes a two-layer Viterbi search method in which only the subsets that have been selected in the first layer are further checked in the second layer. Using this idea, the system has been successfully integrated into real-time applications. Furthermore, the paper also proposes a mapping method to generate emotional facial expressions from neutral facial expressions based on Gaussian mixture models (GMMs). Final experiments prove that the method described can output synthesized facial parameters with high quality. Compared with other audiovisual mapping methods, our method has better performance with respect to expressiveness, stability, and system running speed.