Quality-enhanced voice morphing using maximum likelihood transformations

Authors:
Hui Ye;S. Young
Affiliations:
Eng. Dept., Cambridge Univ.;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2006

Citing 0
Cited 7

An approach to voice conversion based on non-linear canonical correlation analysis

WiCOM'09 Proceedings of the 5th International Conference on Wireless communications, networking and mobile computing
Voice conversion based on weighted frequency warping

IEEE Transactions on Audio, Speech, and Language Processing
INCA algorithm for training voice conversion systems from nonparallel corpora

IEEE Transactions on Audio, Speech, and Language Processing
Emotion conversion based on prosodic unit selection

IEEE Transactions on Audio, Speech, and Language Processing
Speaker-independent HMM-based voice conversion using adaptive quantization of the fundamental frequency

Speech Communication
Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data

Speech Communication
Synthesis of Spontaneous Speech With Syllable Contraction Using State-Based Context-Dependent Voice Transformation

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Voice morphing is a technique for modifying a source speaker's speech to sound as if it was spoken by some designated target speaker. The core process in a voice morphing system is the transformation of the spectral envelope of the source speaker to match that of the target speaker and linear transformations estimated from time-aligned parallel training data are commonly used to achieve this. However, the naive application of envelope transformation combined with the necessary pitch and duration modifications will result in noticeable artifacts. This paper studies the linear transformation approach to voice morphing and investigates these two specific issues. First, a general maximum likelihood framework is proposed for transform estimation which avoids the need for parallel training data inherent in conventional least mean square approaches. Second, the main causes of artifacts are identified as being due to glottal coupling, unnatural phase dispersion and the high spectral variance of unvoiced sounds, and compensation techniques are developed to mitigate these. The resulting voice morphing system is evaluated using both subjective and objective measures. These tests show that the proposed approaches are capable of effectively transforming speaker identity whilst maintaining high quality. Furthermore, they do not require carefully prepared parallel training data