Exploiting prosody hierarchy and dynamic features for pitch modeling and generation in HMM-based speech synthesis

Authors:
Chi-Chun Hsia;Chung-Hsien Wu;Jung-Yun Wu
Affiliations:
ICT-Enabled Healthcare Program, Industrial Technology Research Institute--South, Tainan, Taiwan;Department of Computer Science and Information Engineering National Cheng Kung University, Tainan, Taiwan;Department of Computer Science and Information Engineering National Cheng Kung University, Tainan, Taiwan
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2010

Citing 6
Cited 0

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

Speech Communication
Speech Representation and Transformation IJsing Adaptive Interpolation of Weighted Spectrum: VOCODER Revisited

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Unit selection in a concatenative speech synthesis system using a large speech database

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
A comparison of grapheme and phoneme-based units for Spanish spoken term detection

Speech Communication
Robust speaker-adaptive HMM-based text-to-speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
Variable-Length Unit Selection in TTS Using Structural Syntactic Cost

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a method for modeling and generating pitch in hidden Markov model (HMM)-based Mandarin speech synthesis by exploiting prosody hierarchy and dynamic pitch features. The prosodic structure of a sentence is represented by a prosody hierarchy, which is constructed from the predicted prosodic breaks using a supervised classification and regression tree (S-CART). The S-CART is trained by maximizing the proportional reduction of entropy to minimize the errors in the prediction of the prosodic breaks. The pitch contour of a speech sentence is estimated using the STRAIGHT algorithm and decomposed into the prosodic features (static features) at prosodic word, syllable, and frame layers, based on the predicted prosodic structure. Dynamic features at each layer are estimated to preserve the temporal correlation between adjacent units. A hierarchical prosody model is constructed using an unsupervised CART (U-CART) for generating pitch contour. Minimum description length (MDL) is adopted in U-CART training. Objective and subjective evaluations with statistical hypothesis testing were conducted, and the results compared to corresponding results for HMM-based pitch modeling. The comparison confirms the improved performance of the proposed method.