Transformation of formants for voice conversion using artificial neural networks
Speech Communication - Special issue: voice conversion: state of the art and perspectives
Speaker transformation algorithm using segmental codebooks (STASC)
Speech Communication
High-resolution voice transformation
High-resolution voice transformation
Voice Transformation: A survey
ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Theory and Applications of Digital Speech Processing
Theory and Applications of Digital Speech Processing
Voice conversion based on weighted frequency warping
IEEE Transactions on Audio, Speech, and Language Processing
Spectral mapping using artificial neural networks for voice conversion
IEEE Transactions on Audio, Speech, and Language Processing
Statistical Approach for Voice Personality Transformation
IEEE Transactions on Audio, Speech, and Language Processing
Quality-enhanced voice morphing using maximum likelihood transformations
IEEE Transactions on Audio, Speech, and Language Processing
Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory
IEEE Transactions on Audio, Speech, and Language Processing
Voice Conversion Using Dynamic Kernel Partial Least Squares Regression
IEEE Transactions on Audio, Speech, and Language Processing
Hi-index | 0.00 |
Voice conversion (VC) is a technique aiming to mapping the individuality of a source speaker to that of a target speaker, wherein Gaussian mixture model (GMM) based methods are evidently prevalent. Despite their wide use, two major problems remains to be resolved, i.e., over-smoothing and over-fitting. The latter one arises naturally when the structure of model is too complicated given limited amount of training data. Recently, a new voice conversion method based on Gaussian processes (GPs) was proposed, whose nonparametric nature ensures that the over-fitting problem can be alleviated significantly. Meanwhile, it is flexible to perform non-linear mapping under the framework of GPs by introducing sophisticated kernel functions. Thus this kind of method deserves to be explored thoroughly in this paper. To further improve the performance of the GP-based method, a strategy for mapping prosodic and spectral features coherently is adopted, making the best use of the intercorrelations embedded among both excitation and vocal tract features. Moreover, the accuracy in computing the kernel functions of GP can be improved by resorting to an asymmetric training strategy that allows the dimensionality of input vectors being reasonably higher than that of the output vectors without additional computational costs. Experiments have been conducted to confirm the effectiveness of the proposed method both objectively and subjectively, which have demonstrated that improvements can be obtained by GP-based method compared to the traditional GMM-based approach.