Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques

Authors:
Oytun Türk;Marc Schröder
Affiliations:
Sensory, Inc., Portland, OR;Speech Group, DFKI GmbH Language Technology Lab, Saarbruecken, Germany
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2010

Citing 9
Cited 1

MBR-PSOLA: Text-To-Speech synthesis based on an MBE re-synthesis of the segments database

Speech Communication - Speech science and technology: a selection from the papers presented at the Fourth International Conference in Speech Science and Technology (SST-92)
Speaker transformation algorithm using segmental codebooks (STASC)

Speech Communication
High-resolution voice transformation

High-resolution voice transformation
Robust and efficient quantization of speech LSP parameters using structured vector quantizers

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
Assessment and correction of voice quality variabilities in large speech databases for concatenative speech synthesis

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Data-driven emotion conversion in spoken English

Speech Communication
Voice conversion using Artificial Neural Networks

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Application of voice conversion for cross-language rap singing transformation

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Voice conversion for various types of body transmitted speech

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing

A review of personality in voice-based man machine interaction

HCII'11 Proceedings of the 14th international conference on Human-computer interaction: interaction techniques and environments - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Generating expressive synthetic voices requires carefully designed databases that contain sufficient amount of expressive speech material. This paper investigates voice conversion and modification techniques to reduce database collection and processing efforts while maintaining acceptable quality and naturalness. In a factorial design, we study the relative contributions of voice quality and prosody as well as the amount of distortions introduced by the respective signal manipulation steps. The unit selection engine in our open source and modular text-to-speech (TTS) framework MARY is extended with voice quality transformation using either GMM-based prediction or vocal tract copy resynthesis. These algorithms are then cross-combined with various prosody copy resynthesis methods. The overall expressive speech generation process functions as a postprocessing step on TTS outputs to transform neutral synthetic speech into aggressive, cheerful, or depressed speech. Cross-combinations of voice quality and prosody transformation algorithms are compared in listening tests for perceived expressive style and quality. The results show that there is a tradeoff between identification and naturalness. Combined modeling of both voice quality and prosody leads to the best identification scores at the expense of lowest naturalness ratings. The fine detail of both voice quality and prosody, as preserved by the copy synthesis, did contribute to a better identification as compared to the approximate models.