Selecting non-uniform units from a very large corpus for concatenative speech synthesizer

  • Authors:
  • Min Chu;Hu Peng;Hong-yun Yang;E. Chang

  • Affiliations:
  • Microsoft Res. China, Beijing, China;-;-;-

  • Venue:
  • ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing, 200. on IEEE International Conference - Volume 02
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a two-module text to speech system (TTS) structure, which bypasses the prosody model that predicts numerical prosodic parameters for synthetic speech. Instead, many instances of each basic unit from a large speech corpus are classified into categories by a classification and regression tree (CART), in which the expectation of the weighted sum of square regression error of prosodic features is used as splitting criterion. Better prosody is achieved by keeping slender diversity in prosodic features of instances belonging to the same class. A multi-tier non-uniform unit selection method is presented. It makes the best decision on unit selection by minimizing the concatenated cost of a whole utterance. Since the largest available and suitable units are selected for concatenating, distortion caused by mismatches at concatenated points is minimized. Very natural and fluent speech is synthesized, according to informal listening test.