Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data

Authors:
Ning Xu;Yibing Tang;Jingyi Bao;Aiming Jiang;Xiaofeng Liu;Zhen Yang
Affiliations:
-;-;-;-;-;-
Venue:
Speech Communication
Year:
2014

Citing 12
Cited 0

Transformation of formants for voice conversion using artificial neural networks

Speech Communication - Special issue: voice conversion: state of the art and perspectives
Speaker transformation algorithm using segmental codebooks (STASC)

Speech Communication
High-resolution voice transformation

High-resolution voice transformation
Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model

Speech Communication
Voice Transformation: A survey

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Theory and Applications of Digital Speech Processing

Theory and Applications of Digital Speech Processing
Voice conversion based on weighted frequency warping

IEEE Transactions on Audio, Speech, and Language Processing
Spectral mapping using artificial neural networks for voice conversion

IEEE Transactions on Audio, Speech, and Language Processing
Statistical Approach for Voice Personality Transformation

IEEE Transactions on Audio, Speech, and Language Processing
Quality-enhanced voice morphing using maximum likelihood transformations

IEEE Transactions on Audio, Speech, and Language Processing
Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory

IEEE Transactions on Audio, Speech, and Language Processing
Voice Conversion Using Dynamic Kernel Partial Least Squares Regression

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Voice conversion (VC) is a technique aiming to mapping the individuality of a source speaker to that of a target speaker, wherein Gaussian mixture model (GMM) based methods are evidently prevalent. Despite their wide use, two major problems remains to be resolved, i.e., over-smoothing and over-fitting. The latter one arises naturally when the structure of model is too complicated given limited amount of training data. Recently, a new voice conversion method based on Gaussian processes (GPs) was proposed, whose nonparametric nature ensures that the over-fitting problem can be alleviated significantly. Meanwhile, it is flexible to perform non-linear mapping under the framework of GPs by introducing sophisticated kernel functions. Thus this kind of method deserves to be explored thoroughly in this paper. To further improve the performance of the GP-based method, a strategy for mapping prosodic and spectral features coherently is adopted, making the best use of the intercorrelations embedded among both excitation and vocal tract features. Moreover, the accuracy in computing the kernel functions of GP can be improved by resorting to an asymmetric training strategy that allows the dimensionality of input vectors being reasonably higher than that of the output vectors without additional computational costs. Experiments have been conducted to confirm the effectiveness of the proposed method both objectively and subjectively, which have demonstrated that improvements can be obtained by GP-based method compared to the traditional GMM-based approach.