Least-Squares Fitting of Two 3-D Point Sets
IEEE Transactions on Pattern Analysis and Machine Intelligence
Lip movement synthesis from speech based on hidden Markov models
Speech Communication - Special issue on auditory-visual speech processing
Codebook based face point trajectory synthesis algorithm using speech input
Speech Communication
Computer Vision
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Digital Image Processing (3rd Edition)
Digital Image Processing (3rd Edition)
A rootfinding algorithm for line spectral frequencies
ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 02
IEEE Transactions on Circuits and Systems for Video Technology
Real-time speech-driven face animation with expressions using neural networks
IEEE Transactions on Neural Networks
Lip reading of hearing impaired persons using HMM
Expert Systems with Applications: An International Journal
Expression transfer for facial sketch animation
Signal Processing
Hi-index | 0.00 |
In this study, a complete system that generates visual speech by synthesizing 3D face points has been implemented. The estimated face points drive MPEG-4 facial animation. This system is speaker independent and can be driven by audio or both audio and text. The synthesis of visual speech was realized by a codebook-based technique, which is trained with audio-visual data from a speaker. An audio-visual speech data set in Turkish language was created using a 3D facial motion capture system that was developed for this study. The performance of this method was evaluated in three categories. First, audio-driven results were reported, and compared with the time-delayed neural network (TDNN) and recurrent neural network (RNN) algorithms, which are popular in speech-processing field. It was found out that TDNN performs best and RNN performs worst for this data set. Second, results for the codebook-based method after incorporating text information were given. It was seen that text information together with the audio improves the synthesis performance significantly. For many applications, the donor speaker for the audio-visual data will not be available to provide audio data for synthesis. Therefore, we designed a speaker-independent version of this codebook technique. The results of speaker-independent synthesis are important, because there are no comparative results reported for speech input from other speakers to animate the face model. It was observed that although there is small degradation in the trajectory correlation (0.71-0.67) with respect to speaker-dependent synthesis, the performance results are quite satisfactory. Thus, the resulting system is capable of animating faces realistically from input speech of any Turkish speaker.