Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model

Authors:
Tomoki Toda;Alan W. Black;Keiichi Tokuda
Affiliations:
Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 603-0192, Japan;Language Technologies Institute, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA;Graduate School of Engineering, Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya-shi, Aichi 466-8555, Japan
Venue:
Speech Communication
Year:
2008

Citing 3
Cited 14

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

Speech Communication
Unit selection in a concatenative speech synthesis system using a large speech database

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Selecting non-uniform units from a very large corpus for concatenative speech synthesizer

ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing, 200. on IEEE International Conference - Volume 02

Face active appearance modeling and speech acoustic information to recover articulation

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Integrating articulatory features into HMM-based parametric speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
An Analysis of HMM-based prediction of articulatory movements

Speech Communication
Correcting errors in speech recognition with articulatory dynamics

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Towards a noisy-channel model of dysarthria in speech recognition

SLPAT '10 Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies
Mapping between acoustic and articulatory gestures

Speech Communication
Using articulatory likelihoods in the recognition of dysarthric speech

Speech Communication
Estimation of relevant time-frequency features using Kendall coefficient for articulator position inference

Speech Communication
Statistical methods for estimation of direct and differential kinematics of the vocal tract

Speech Communication
The TORGO database of acoustic and articulatory speech from speakers with dysarthria

Language Resources and Evaluation
Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information

Image and Vision Computing
An adaptive neural control scheme for articulatory synthesis of CV sequences

Computer Speech and Language
Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data

Speech Communication
Alaryngeal Speech Enhancement Based on One-to-Many Eigenvoice Conversion

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe a statistical approach to both an articulatory-to-acoustic mapping and an acoustic-to-articulatory inversion mapping without using phonetic information. The joint probability density of an articulatory parameter and an acoustic parameter is modeled using a Gaussian mixture model (GMM) based on a parallel acoustic-articulatory speech database. We apply the GMM-based mapping using the minimum mean-square error (MMSE) criterion, which has been proposed for voice conversion, to the two mappings. Moreover, to improve the mapping performance, we apply maximum likelihood estimation (MLE) to the GMM-based mapping method. The determination of a target parameter trajectory having appropriate static and dynamic properties is obtained by imposing an explicit relationship between static and dynamic features in the MLE-based mapping. Experimental results demonstrate that the MLE-based mapping with dynamic features can significantly improve the mapping performance compared with the MMSE-based mapping in both the articulatory-to-acoustic mapping and the inversion mapping.