Integrating articulatory features into HMM-based parametric speech synthesis

Authors:
Zhen-Hua Ling;Korin Richmond;Junichi Yamagishi;Ren-Hua Wang
Affiliations:
iFlytek Speech Lab, University of Science and Technology of China, Hefei, China;Center for Speech Technology Research, University of Edinburgh, Edinburgh, UK;Center for Speech Technology Research, University of Edinburgh, Edinburgh, UK;iFlytek Speech Lab, University of Science and Technology of China, Hefei, China
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2009

Citing 12
Cited 3

X-ray microbeam method for measurement of articulatory dynamics-techniques and results

Speech Communication - Special issue: Speech research in Japan
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

Speech Communication
Extraction and Tracking of the Tongue Surface from Ultrasound Image Sequences

CVPR '98 Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis

IEICE - Transactions on Information and Systems
Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing

IEICE - Transactions on Information and Systems
Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005

IEICE - Transactions on Information and Systems
Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training

IEICE - Transactions on Information and Systems
Hidden Markov models based on multi-space probability distribution for pitch pattern modeling

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model

Speech Communication
A Style Control Technique for HMM-Based Expressive Speech Synthesis

IEICE - Transactions on Information and Systems
Trajectory mixture density networks with multiple mixtures for acoustic-articulatory inversion

NOLISP'07 Proceedings of the 2007 international conference on Advances in nonlinear speech processing
Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm

IEEE Transactions on Audio, Speech, and Language Processing

Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis

Speech Communication
An Analysis of HMM-based prediction of articulatory movements

Speech Communication
Synthesis of Spontaneous Speech With Syllable Contraction Using State-Based Context-Dependent Voice Transformation

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an investigation into ways of integrating articulatory features into hidden Markov model (HMM)-based parametric speech synthesis. In broad terms, this may be achieved by estimating the joint distribution of acoustic and articulatory features during training. This may in turn be used in conjunction with a maximum-likelihood criterion to produce acoustic synthesis parameters for generating speech. Within this broad approach, we explore several variations that are possible in the construction of an HMM-based synthesis system which allow articulatory features to influence acoustic modeling: model clustering, state synchrony and cross-stream feature dependency. Performance is evaluated using the RMS error of generated acoustic parameters as well as formal listening tests. Our results show that the accuracy of acoustic parameter prediction and the naturalness of synthesized speech can be improved when shared clustering and asynchronous-state model structures are adopted for combined acoustic and articulatory features. Most significantly, however, our experiments demonstrate that modeling the dependency between these two feature streams can make speech synthesis systems more flexible. The characteristics of synthetic speech can be easily controlled by modifying generated articulatory features as part of the process of producing acoustic synthesis parameters.