Selecting non-uniform units from a very large corpus for concatenative speech synthesizer

Authors:
Min Chu;Hu Peng;Hong-yun Yang;E. Chang
Affiliations:
Microsoft Res. China, Beijing, China;-;-;-
Venue:
ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing, 200. on IEEE International Conference - Volume 02
Year:
2001

Citing 0
Cited 7

Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model

Speech Communication
A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis

IEICE - Transactions on Information and Systems
Enrich web applications with voice internet persona text-to-speech for anyone, anywhere

HCI'07 Proceedings of the 12th international conference on Human-computer interaction: intelligent multimodal interaction environments
Non-uniform unit selection in Vietnamese speech synthesis

Proceedings of the Second Symposium on Information and Communication Technology
Selecting prosody parameters for unit selection based chinese TTS

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
An HMM-based mandarin chinese text-to-speech system

ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing
The paradigm for creating multi-lingual text-to-speech voice databases

ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a two-module text to speech system (TTS) structure, which bypasses the prosody model that predicts numerical prosodic parameters for synthetic speech. Instead, many instances of each basic unit from a large speech corpus are classified into categories by a classification and regression tree (CART), in which the expectation of the weighted sum of square regression error of prosodic features is used as splitting criterion. Better prosody is achieved by keeping slender diversity in prosodic features of instances belonging to the same class. A multi-tier non-uniform unit selection method is presented. It makes the best decision on unit selection by minimizing the concatenated cost of a whole utterance. Since the largest available and suitable units are selected for concatenating, distortion caused by mismatches at concatenated points is minimized. Very natural and fluent speech is synthesized, according to informal listening test.