Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis

Authors:
Chung-Hsien Wu;Chi-Chun Hsia;Chung-Han Lee;Mai-Chun Lin
Affiliations:
Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan;Industrial Technology Research Institute-South, Hsinchu, Taiwan;Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan;Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2010

Citing 10
Cited 2

Fundamentals of statistical signal processing: estimation theory

Fundamentals of statistical signal processing: estimation theory
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

Speech Communication
Pitch targets and their realization: evidence from Mandarin Chinese

Speech Communication
A corpus-based speech synthesis system with emotion

Speech Communication - Special issue on speech and emotion
Speech Representation and Transformation IJsing Adaptive Interpolation of Weighted Spectrum: VOCODER Revisited

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Stochastic modeling of spectral adjustment for high quality pitch modification

ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 02
Conversion Function Clustering and Selection Using Linguistic and Spectral Information for Emotional Voice Conversion

IEEE Transactions on Computers
Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
Prosody conversion from neutral speech to emotional speech

IEEE Transactions on Audio, Speech, and Language Processing
Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory

IEEE Transactions on Audio, Speech, and Language Processing

A new class of discrete orthogonal polynomials for blind fitting of finite data

Signal Processing
Synthesis of Spontaneous Speech With Syllable Contraction Using State-Based Context-Dependent Voice Transformation

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an approach to hierarchical prosody conversion for emotional speech synthesis. The pitch contour of the source speech is decomposed into a hierarchical prosodic structure consisting of sentence, prosodic word, and subsyllable levels. The pitch contour in the higher level is encoded by the discrete Legendre polynomial coefficients. The residual, the difference between the source pitch contour and the pitch contour decoded from the discrete Legendre polynomial coefficients, is then used for pitch modeling at the lower level. For prosody conversion, Gaussian mixture models (GMMs) are used for sentence-and prosodic word-level conversion. At subsyllable level, the pitch feature vectors are clustered via a proposed regression-based clustering method to generate the prosody conversion functions for selection. Linguistic and symbolic prosody features of the source speech are adopted to select the most suitable function using the classification and regression tree for prosody conversion. Three small-sized emotional parallel speech databases with happy, angry, and sad emotions, respectively, were designed and collected for training and evaluation. Objective and subjective evaluations were conducted and the comparison results to the GMM-based method for prosody conversion achieved an improved performance using the hierarchical prosodic structure and the proposed regression-based clustering method.