Supervisory data alignment for text-independent voice conversion

Authors:
Jianhua Tao;Meng Zhang;Jani Nurminen;Jilei Tian;Xia Wang
Affiliations:
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China;National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China;Nokia Devices R&D, Tampere, Finland;Nokia Research Center, Beijing, China;Nokia Research Center, Beijing, China
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2010

Citing 13
Cited 0

Voice transformation using PSOLA technique

Speech Communication - Eurospeech '91
Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks

Speech Communication - Special issue: voice conversion: state of the art and perspectives
Transformation of formants for voice conversion using artificial neural networks

Speech Communication - Special issue: voice conversion: state of the art and perspectives
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

Speech Communication
Speaker transformation algorithm using segmental codebooks (STASC)

Speech Communication
Self-Organizing Maps

Self-Organizing Maps
Incremental Nonlinear Dimensionality Reduction by Manifold Learning

IEEE Transactions on Pattern Analysis and Machine Intelligence
Robust and efficient quantization of speech LSP parameters using structured vector quantizers

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum

ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing, 200. on IEEE International Conference - Volume 02
A self-organizing map with twin units capable of describing a nonlinear input-output relation applied to speech code vector mapping

Information Sciences: an International Journal
Rapid and brief communication: Incremental locally linear embedding

Pattern Recognition
Embedding new data points for manifold learning via coordinate propagation

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose new supervisory data alignment methods for text-independent voice conversion which do not need parallel training corpora. Phonetic information is used as a restriction during alignment for mapping the data from the source speaker onto the parameter space of a target speaker. Both linear and nonlinear methods are derived by considering alignment accuracy and topology preservation. For the linear alignment, we consider common phoneme clusters of the source and target space as benchmarks and adapt the source data vector to the target space while maintaining the relative phonetic positions among neighborhood clusters. In order to preserve the topological structure of the source parameter space and improve the stability of conversion and the accuracy of the phonetic mapping, a supervised self-organizing learning algorithm considering phonetic restriction is proposed for iteratively improving the alignment outcome of the previous step. Both the linear and nonlinear methods can also be applied in the cross-lingual case. Evaluation results show that the proposed methods improve the performance of alignment in terms of both alignment accuracy and stability for text-independent voice conversion in intra-lingual and cross-lingual cases.