Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis

Authors:
Wesley Mattheyses;Lukas Latacz;Werner Verhelst
Affiliations:
Vrije Universiteit Brussel, Dept. ETRO-DSSP, Pleinlaan 2, B-1050 Brussel, Belgium;Vrije Universiteit Brussel, Dept. ETRO-DSSP, Pleinlaan 2, B-1050 Brussel, Belgium;Vrije Universiteit Brussel, Dept. ETRO-DSSP, Pleinlaan 2, B-1050 Brussel, Belgium and iMinds, Gaston Crommenlaan 8, Box 102, B-9050 Ghent, Belgium
Venue:
Speech Communication
Year:
2013

Citing 18
Cited 0

Video Rewrite: driving visual speech with audio

Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Codebook based face point trajectory synthesis algorithm using speech input

Speech Communication
Visual Speech Synthesis by Morphing Visemes

International Journal of Computer Vision - special issue on learning and vision at the center for biological and computational learning, Massachusetts Institute of Technology
Active Appearance Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Trainable videorealistic speech animation

Proceedings of the 29th annual conference on Computer graphics and interactive techniques
Classifying Visemes for Automatic Lipreading

TSD '99 Proceedings of the Second International Workshop on Text, Speech and Dialogue
Speech-Driven Face Synthesis from 3D Video

3DPVT '04 Proceedings of the 3D Data Processing, Visualization, and Transmission, 2nd International Symposium
A segment-based audio-visual speech recognizer: data collection, development, and initial experiments

Proceedings of the 6th international conference on Multimodal interfaces
Technical Section: Facial animation based on context-dependent visemes

Computers and Graphics
Design, implementation and evaluation of the Czech realistic audio-visual speech synthesis

Signal Processing - Special section: Multimodal human-computer interfaces
Unit selection in a concatenative speech synthesis system using a large speech database

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis

MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
On the importance of audiovisual coherence for the perceived quality of synthesized visual speech

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on animating virtual speakers or singers from audio: Lip-synching facial animation
Emphatic visual speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Compact 2D facial animation based on context-dependent visemes

Proceedings of the SSPNET 2nd International Symposium on Facial Analysis and Animation
Realistic facial expression synthesis for an image-based talking head

ICME '11 Proceedings of the 2011 IEEE International Conference on Multimedia and Expo
Least squares quantization in PCM

IEEE Transactions on Information Theory
Dynamic units of visual speech

EUROSCA'12 Proceedings of the 11th ACM SIGGRAPH / Eurographics conference on Computer Animation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The use of visemes as atomic speech units in visual speech analysis and synthesis systems is well-established. Viseme labels are determined using a many-to-one phoneme-to-viseme mapping. However, due to visual coarticulation effects, an accurate mapping from phonemes to visemes should define a many-to-many mapping scheme instead. In this research it was found that neither the use of standardized nor speaker-dependent many-to-one viseme labels could satisfy the quality requirements of concatenative visual speech synthesis. Therefore, a novel technique to define a many-to-many phoneme-to-viseme mapping scheme is introduced, which makes use of both tree-based and k-means clustering approaches. We show that these many-to-many viseme labels more accurately describe the visual speech information as compared to both phoneme-based and many-to-one viseme-based speech labels. In addition, we found that the use of these many-to-many visemes improves the precision of the segment selection phase in concatenative visual speech synthesis using limited speech databases. Furthermore, the resulting synthetic visual speech was both objectively and subjectively found to be of higher quality when the many-to-many visemes are used to describe the speech database and the synthesis targets.