On the importance of audiovisual coherence for the perceived quality of synthesized visual speech

Authors:
Wesley Mattheyses;Lukas Latacz;Werner Verhelst
Affiliations:
Department of ETRO-DSSP, Vrije Universiteit Brussel, Brussels, Belgium;Department of ETRO-DSSP, Vrije Universiteit Brussel, Brussels, Belgium;Department of ETRO-DSSP, Vrije Universiteit Brussel, Brussels, Belgium
Venue:
EURASIP Journal on Audio, Speech, and Music Processing - Special issue on animating virtual speakers or singers from audio: Lip-synching facial animation
Year:
2009

Citing 9
Cited 3

Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

Speech Communication
Video Rewrite: driving visual speech with audio

Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Digital Image Warping

Digital Image Warping
Trainable videorealistic speech animation

Proceedings of the 29th annual conference on Computer graphics and interactive techniques
Active Appearance Models

ECCV '98 Proceedings of the 5th European Conference on Computer Vision-Volume II - Volume II
Visual Prosody: Facial Movements Accompanying Speech

FGR '02 Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition
Visual Speech Synthesis by Morphing Visemes

Visual Speech Synthesis by Morphing Visemes
Unit selection in a concatenative speech synthesis system using a large speech database

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Photo-realistic talking-heads from image samples

IEEE Transactions on Multimedia

Photorealistic 2D audiovisual text-to-speech synthesis using active appearance models

Proceedings of the SSPNET 2nd International Symposium on Facial Analysis and Animation
Evaluating a synthetic talking head using a dual task: Modality effects on speech understanding and cognitive load

International Journal of Human-Computer Studies
Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Audiovisual text-to-speech systems convert a written text into an audiovisual speech signal. Typically, the visual mode of the synthetic speech is synthesized separately from the audio, the latter being either natural or synthesized speech. However, the perception of mismatches between these two information streams requires experimental exploration since it could degrade the quality of the output. In order to increase the intermodal coherence in synthetic 2D photorealistic speech, we extended the wellknown unit selection audio synthesis technique to work with multimodal segments containing original combinations of audio and video. Subjective experiments confirm that the audiovisual signals created by our multimodal synthesis strategy are indeed perceived as being more synchronous than those of systems in which both modes are not intrinsically coherent. Furthermore, it is shown that the degree of coherence between the auditory mode and the visual mode has an influence on the perceived quality of the synthetic visual speech fragment. In addition, the audio quality was found to have only a minor influence on the perceived visual signal's quality.