Learning Prosodic Patterns for Mandarin Speech Synthesis

  • Authors:
  • Yiqiang Chen;Wen Gao;Tingshao Zhu;Charles Ling

  • Affiliations:
  • Institute of Computing Technology, Chinese Academy of Sciences, Beijing, People's Republic of China 100080. yqchen@ict.ac.cn;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, People's Republic of China 100080. wgao@ict.ac.cn;Department of Computing Science, University of Alberta, Edmonton, Canada T6G 2E1. tszhu@cs.ualberta.ca;Department of Computer Science, University of West Ontario, London, Ontario, Canada N6A 5B7. ling@csd.uwo.ca

  • Venue:
  • Journal of Intelligent Information Systems
  • Year:
  • 2002

Quantified Score

Hi-index 0.01

Visualization

Abstract

Higher quality synthesized speech is required for widespread use of text-to-speech (TTS) technology, and the prosodic pattern is the key feature that makes synthetic speech sound unnatural and monotonous, which mainly describes the variation of pitch. The rules used in most Chinese TTS systems are constructed by experts, with weak quality control and low precision. In this paper, we propose a combination of clustering and machine learning techniques to extract prosodic patterns from actual large mandarin speech databases to improve the naturalness and intelligibility of synthesized speech. Typical prosody models are found by clustering analysis. Some machine learning techniques, including Rough Set, Artificial Neural Network (ANN) and Decision tree, are trained for fundamental frequency and energy contours, which can be directly used in a pitch-synchronous-overlap-add-based (PSOLA-based) TTS system. The experimental results showed that synthesized prosodic features greatly resembled their original counterparts for most syllables.