An Analysis of HMM-based prediction of articulatory movements

Authors:
Zhen-Hua Ling;Korin Richmond;Junichi Yamagishi
Affiliations:
iFLYTEK Speech Lab, University of Science and Technology of China, Hefei, Anhui 230027, PR China;The Centre for Speech Technology Research (CSTR), University of Edinburgh, Edinburgh EH8 9LW, United Kingdom;The Centre for Speech Technology Research (CSTR), University of Edinburgh, Edinburgh EH8 9LW, United Kingdom
Venue:
Speech Communication
Year:
2010

Citing 7
Cited 0

X-ray microbeam method for measurement of articulatory dynamics-techniques and results

Speech Communication - Special issue: Speech research in Japan
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

Speech Communication
Multi-speaker articulatory trajectory formation based on speaker-independent articulatory HMMs

Speech Communication
Hidden Markov models based on multi-space probability distribution for pitch pattern modeling

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model

Speech Communication
Integrating articulatory features into HMM-based parametric speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
Trajectory mixture density networks with multiple mixtures for acoustic-articulatory inversion

NOLISP'07 Proceedings of the 2007 international conference on Advances in nonlinear speech processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an investigation into predicting the movement of a speaker's mouth from text input using hidden Markov models (HMM). A corpus of human articulatory movements, recorded by electromagnetic articulography (EMA), is used to train HMMs. To predict articulatory movements for input text, a suitable model sequence is selected and a maximum-likelihood parameter generation (MLPG) algorithm is used to generate output articulatory trajectories. Unified acoustic-articulatory HMMs are introduced to integrate acoustic features when an acoustic signal is also provided with the input text. Several aspects of this method are analyzed in this paper, including the effectiveness of context-dependent modeling, the role of supplementary acoustic input, and the appropriateness of certain model structures for the unified acoustic-articulatory models. When text is the sole input, we find that fully context-dependent models significantly outperform monophone and quinphone models, achieving an average root mean square (RMS) error of 1.945mm and an average correlation coefficient of 0.600. When both text and acoustic features are given as input to the system, the difference between the performance of quinphone models and fully context-dependent models is no longer significant. The best performance overall is achieved using unified acoustic-articulatory quinphone HMMs with separate clustering of acoustic and articulatory model parameters, a synchronous-state sequence, and a dependent-feature model structure, with an RMS error of 0.900mm and a correlation coefficient of 0.855 on average. Finally, we also apply the same quinphone HMMs to the acoustic-articulatory, or inversion, mapping problem, where only acoustic input is available. An average root mean square (RMS) error of 1.076mm and an average correlation coefficient of 0.812 are achieved. Taken together, our results demonstrate how text and acoustic inputs both contribute to the prediction of articulatory movements in the method used.