Synthesis of Spontaneous Speech With Syllable Contraction Using State-Based Context-Dependent Voice Transformation

Authors:
Chung-Hsien Wu;Yi-Chin Huang;Chung-Han Lee;Jun-Cheng Guo
Affiliations:
Department of Computer Science and Information Engineering, National Cheng-Kung University, Tainan, Taiwan;Department of Computer Science and Information Engineering, National Cheng-Kung University, Tainan, Taiwan;Department of Computer Science and Information Engineering, National Cheng-Kung University, Tainan, Taiwan;Department of Computer Science and Information Engineering, National Cheng-Kung University, Tainan, Taiwan
Venue:
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)
Year:
2014

Citing 13
Cited 0

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

Speech Communication
Effects of speaking rate and word frequency on pronunciations in conversational speech

Speech Communication - Special issue on modeling pronunciation variation for automatic speech recognition
Unit selection in a concatenative speech synthesis system using a large speech database

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Conversion Function Clustering and Selection Using Linguistic and Spectral Information for Emotional Voice Conversion

IEEE Transactions on Computers
Integrating articulatory features into HMM-based parametric speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
Robust speaker-adaptive HMM-based text-to-speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
An adaptive algorithm for mel-cepstral analysis of speech

ICASSP'92 Proceedings of the 1992 IEEE international conference on Acoustics, speech and signal processing - Volume 1
Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm

IEEE Transactions on Audio, Speech, and Language Processing
Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
Quality-enhanced voice morphing using maximum likelihood transformations

IEEE Transactions on Audio, Speech, and Language Processing
Variable-Length Unit Selection in TTS Using Structural Syntactic Cost

IEEE Transactions on Audio, Speech, and Language Processing
Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Pronunciation normally varies in spontaneous speech, and is an integral aspect of spontaneous expression. This study describes a voice transformation-based approach to generating spontaneous speech with syllable contractions for Hidden Markov Model (HMM)-based speech synthesis. A multi-dimensional linear regression model is adopted as the context-dependent, state-based transformation function to convert the feature sequence of read speech to that of spontaneous speech with syllable contraction. With insufficient number of training data, the obtained transformation functions are categorized using a decision tree based on linguistic and articulatory features for better and efficient selection of suitable transformation functions. Furthermore, to cope with the problem of small parallel corpus, cross-validation of trained transformation function is performed to ensure correct transformation functions are obtained and prevent over-fitting. Consequently, pronunciation variations of syllable contraction for the trained and the unseen syllable-contracted words are generated from the transformation function retrieved from the decision tree using linguistic and articulatory features. Objective and subjective tests were used to evaluate the performance of the proposed approach. Evaluation results demonstrate that the proposed transformation function substantially improves apparent spontaneity of the synthesized speech compared to the conventional methods.