Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis

  • Authors:
  • Wesley Mattheyses;Lukas Latacz;Werner Verhelst;Hichem Sahli

  • Affiliations:
  • Dept. ETRO, Vrije Universiteit Brussel, Brussels, Belgium B-1050;Dept. ETRO, Vrije Universiteit Brussel, Brussels, Belgium B-1050;Dept. ETRO, Vrije Universiteit Brussel, Brussels, Belgium B-1050;Dept. ETRO, Vrije Universiteit Brussel, Brussels, Belgium B-1050

  • Venue:
  • MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Audiovisual text-to-speech systems convert a written text into an audiovisual speech signal. Lately much interest goes out to data-driven 2D photorealistic synthesis, where the system uses a database of pre-recorded auditory and visual speech data to construct the target output signal. In this paper we propose a synthesis technique that creates both the target auditory and the target visual speech by using a same audiovisual database. To achieve this, the well-known unit selection synthesis technique is extended to work with multimodal segments containing original combinations of audio and video. This strategy results in a multimodal output signal that displays a high level of audiovisual correlation, which is crucial to achieve a natural perception of the synthetic speech signal.