Speaker-independent 3D face synthesis driven by speech and text

Authors:
Arman Savran;Levent M. Arslan;Lale Akarun
Affiliations:
Electrical & Electronic Engineering Department, Bogazici University, Bebek, Istanbul, Turkey;Electrical & Electronic Engineering Department, Bogazici University, Bebek, Istanbul, Turkey;Computer Engineering Department, Bogazici University, Bebek, Istanbul, Turkey
Venue:
Signal Processing - Fractional calculus applications in signals and systems
Year:
2006

Citing 9
Cited 2

Least-Squares Fitting of Two 3-D Point Sets

IEEE Transactions on Pattern Analysis and Machine Intelligence
Lip movement synthesis from speech based on hidden Markov models

Speech Communication - Special issue on auditory-visual speech processing
Codebook based face point trajectory synthesis algorithm using speech input

Speech Communication
Computer Vision

Computer Vision
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Digital Image Processing (3rd Edition)

Digital Image Processing (3rd Edition)
A rootfinding algorithm for line spectral frequencies

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 02
The facial animation engine: toward a high-level interface for the design of MPEG-4 compliant animated faces

IEEE Transactions on Circuits and Systems for Video Technology
Real-time speech-driven face animation with expressions using neural networks

IEEE Transactions on Neural Networks

Lip reading of hearing impaired persons using HMM

Expert Systems with Applications: An International Journal
Expression transfer for facial sketch animation

Signal Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this study, a complete system that generates visual speech by synthesizing 3D face points has been implemented. The estimated face points drive MPEG-4 facial animation. This system is speaker independent and can be driven by audio or both audio and text. The synthesis of visual speech was realized by a codebook-based technique, which is trained with audio-visual data from a speaker. An audio-visual speech data set in Turkish language was created using a 3D facial motion capture system that was developed for this study. The performance of this method was evaluated in three categories. First, audio-driven results were reported, and compared with the time-delayed neural network (TDNN) and recurrent neural network (RNN) algorithms, which are popular in speech-processing field. It was found out that TDNN performs best and RNN performs worst for this data set. Second, results for the codebook-based method after incorporating text information were given. It was seen that text information together with the audio improves the synthesis performance significantly. For many applications, the donor speaker for the audio-visual data will not be available to provide audio data for synthesis. Therefore, we designed a speaker-independent version of this codebook technique. The results of speaker-independent synthesis are important, because there are no comparative results reported for speech input from other speakers to animate the face model. It was observed that although there is small degradation in the trajectory correlation (0.71-0.67) with respect to speaker-dependent synthesis, the performance results are quite satisfactory. Thus, the resulting system is capable of animating faces realistically from input speech of any Turkish speaker.