Speaker-independent HMM-based voice conversion using adaptive quantization of the fundamental frequency

Authors:
Takashi Nose;Takao Kobayashi
Affiliations:
Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama 226-8502, Japan;Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama 226-8502, Japan
Venue:
Speech Communication
Year:
2011

Citing 11
Cited 0

Fundamentals of speech recognition

Fundamentals of speech recognition
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

Speech Communication
Voice fonts for individuality representation and transformation

ACM Transactions on Speech and Language Processing (TSLP)
A segment-based approach to voice conversion

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis

IEICE - Transactions on Information and Systems
A Hidden Semi-Markov Model-Based Speech Synthesis System

IEICE - Transactions on Information and Systems
Voice conversion based on simultaneous modelling of spectrum and F0

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Voice Transformation: A survey

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Application of voice conversion for cross-language rap singing transformation

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Quality-enhanced voice morphing using maximum likelihood transformations

IEEE Transactions on Audio, Speech, and Language Processing
Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a speaker-independent HMM-based voice conversion technique that incorporates context-dependent prosodic symbols obtained using adaptive quantization of the fundamental frequency (F0). In the HMM-based conversion of our previous study, the input utterance of a source speaker is decoded into phonetic and prosodic symbol sequences, and the converted speech is generated using the decoded information from the pre-trained target speaker's phonetically and prosodically context-dependent HMM. In our previous work, we generated the F0 symbol by quantizing the average log F0 value of each phone using the global mean and variance calculated from the training data. In the current study, these statistical parameters are obtained from each utterance itself, and this adaptive method improves the F0 conversion performance of the conventional one. We also introduce a speaker-independent model for decoding the input speech and model adaptation for training the target speaker's model in order to reduce the required amount of training data under a condition where the phonetic transcription is available for the input speech. Objective and subjective experimental results for Japanese speech demonstrate that the adaptive quantization method gives better F0 conversion performance than the conventional one. Moreover, our technique with only ten sentences of the target speaker's adaptation data outperforms the conventional GMM-based one using parallel data of 200 sentences.