A dynamic cost weighting framework for unit selection text-to-speech synthesis

Authors:
Jerome R. Bellegarda
Affiliations:
Speech and Language Technologies, Apple, Inc., Cupertino, CA
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2010

Citing 6
Cited 2

The weighted majority algorithm

Information and Computation
An introduction to text-to-speech synthesis

An introduction to text-to-speech synthesis
Solving Standard Quadratic Optimization Problems via Linear, Semidefinite and Copositive Programming

Journal of Global Optimization
Convex Optimization

Convex Optimization
Unit selection in a concatenative speech synthesis system using a large speech database

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
A global, boundary-centric framework for unit selection text-to-speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing

Syllable Specific Unit Selection Cost Functions for Text-to-Speech Synthesis

ACM Transactions on Speech and Language Processing (TSLP)
A consistency analysis on an acoustic module for Mandarin text-to-speech

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Unit selection text-to-speech synthesis relies on multiple cost criteria, each encapsulating a different aspect of acoustic and prosodic context at any given concatenation point. Constraints are normally invoked on diverse characteristics such as inter-unit discontinuity overall pitch contour, local duration profile, etc., leading to costs often too heterogeneous for a direct quantitative comparison. In order to rank available candidate uints, this complexity must be reduced to a single number, and the relative importance of each information stream becomes highly critical. Yet this influence is typically determined in an empirical manner (e.g., based on a limited amount of synthesized data), yielding global weights that are thus applied to broad classes of concatenations indiscriminately. This paper proposes an alternative approach, dynamic cost weighting, based on a data-driven framework separately optimized for each concatenation considered. Specifically, the cost distribution in every stream is dynamically leveraged on a per concatenation basis to locally shift weight towards those characteristics that offer a high discrimination between candidate units, and away from those characteristics that are intrinsically less discriminative. An illustrative case study demonstrates the potential benefits of this solution, and listening evidence suggests that it does indeed entail higher perceived TTS quality.