Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis

Authors:
Wesley Mattheyses;Lukas Latacz;Werner Verhelst;Hichem Sahli
Affiliations:
Dept. ETRO, Vrije Universiteit Brussel, Brussels, Belgium B-1050;Dept. ETRO, Vrije Universiteit Brussel, Brussels, Belgium B-1050;Dept. ETRO, Vrije Universiteit Brussel, Brussels, Belgium B-1050;Dept. ETRO, Vrije Universiteit Brussel, Brussels, Belgium B-1050
Venue:
MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Year:
2008

Citing 9
Cited 1

Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

Speech Communication
Video Rewrite: driving visual speech with audio

Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Digital Image Warping

Digital Image Warping
Trainable videorealistic speech animation

Proceedings of the 29th annual conference on Computer graphics and interactive techniques
Text-to-Audiovisual Speech Synthesizer

VW '00 Proceedings of the Second International Conference on Virtual Worlds
Sample-Based Synthesis of Photo-Realistic Talking Heads

CA '98 Proceedings of the Computer Animation
Visual Speech Synthesis by Morphing Visemes

Visual Speech Synthesis by Morphing Visemes
Unit selection in a concatenative speech synthesis system using a large speech database

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Photo-realistic talking-heads from image samples

IEEE Transactions on Multimedia

Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Audiovisual text-to-speech systems convert a written text into an audiovisual speech signal. Lately much interest goes out to data-driven 2D photorealistic synthesis, where the system uses a database of pre-recorded auditory and visual speech data to construct the target output signal. In this paper we propose a synthesis technique that creates both the target auditory and the target visual speech by using a same audiovisual database. To achieve this, the well-known unit selection synthesis technique is extended to work with multimodal segments containing original combinations of audio and video. This strategy results in a multimodal output signal that displays a high level of audiovisual correlation, which is crucial to achieve a natural perception of the synthetic speech signal.