Voice conversion by mapping the speaker-specific features using pitch synchronous approach

Authors:
K. Sreenivasa Rao
Affiliations:
School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721 302, India
Venue:
Computer Speech and Language
Year:
2010

Citing 14
Cited 7

Transformation of formants for voice conversion using artificial neural networks

Speech Communication - Special issue: voice conversion: state of the art and perspectives
Discrete-time signal processing (2nd ed.)

Discrete-time signal processing (2nd ed.)
Speaker transformation algorithm using segmental codebooks (STASC)

Speech Communication
AANN: an alternative to GMM for pattern recognition

Neural Networks
A Parallel Execution Method for Minimizing Distributed Query Response Time

IEEE Transactions on Parallel and Distributed Systems
High-resolution voice transformation

High-resolution voice transformation
Voice Conversion by Prosody and Vocal Tract Modification

ICIT '06 Proceedings of the 9th International Conference on Information Technology
Modeling durations of syllables using neural networks

Computer Speech and Language
Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction

ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing, 200. on IEEE International Conference - Volume 02
Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum

ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing, 200. on IEEE International Conference - Volume 02
Transformation of Speaker Characteristics in Speech Using Support Vector Machines

ADCOM '07 Proceedings of the 15th International Conference on Advanced Computing and Communications
Voice transformation by mapping the features at syllable level

PReMI'07 Proceedings of the 2nd international conference on Pattern recognition and machine intelligence
Statistical Approach for Voice Personality Transformation

IEEE Transactions on Audio, Speech, and Language Processing
Prosody modification using instants of significant excitation

IEEE Transactions on Audio, Speech, and Language Processing

Spectral mapping using artificial neural networks for voice conversion

IEEE Transactions on Audio, Speech, and Language Processing
Comparing ANN and GMM in a voice conversion framework

Applied Soft Computing
Voice conversion using linear prediction coefficients and artificial neural network

Proceedings of the CUBE International Information Technology Conference
Expressive speech synthesis: a review

International Journal of Speech Technology
Film segmentation and indexing using autoassociative neural networks

International Journal of Speech Technology
Identification of Indian languages using multi-level spectral and prosodic features

International Journal of Speech Technology
Pitch synchronous and glottal closure based speech analysis for language recognition

International Journal of Speech Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

The basic goal of the voice conversion system is to modify the speaker-specific characteristics, keeping the message and the environmental information contained in the speech signal intact. Speaker characteristics reflect in speech at different levels, such as, the shape of the glottal pulse (excitation source characteristics), the shape of the vocal tract (vocal tract system characteristics) and the long-term features (suprasegmental or prosodic characteristics). In this paper, we are proposing neural network models for developing mapping functions at each level. The features used for developing the mapping functions are extracted using pitch synchronous analysis. Pitch synchronous analysis provides the estimation of accurate vocal tract parameters, by analyzing the speech signal independently in each pitch period without influenced by the adjacent pitch cycles. In this work, the instants of significant excitation are used as pitch markers to perform the pitch synchronous analysis. The instants of significant excitation correspond to the instants of glottal closure (epochs) in the case of voiced speech, and to some random excitations like onset of burst in the case of nonvoiced speech. Instants of significant excitation are computed from the linear prediction (LP) residual of speech signals by using the property of average group-delay of minimum phase signals. In this paper, line spectral frequencies (LSFs) are used for representing the vocal tract characteristics, and for developing its associated mapping function. LP residual of the speech signal is viewed as excitation source, and the residual samples around the instant of glottal closure are used for mapping. Prosodic parameters at syllable and phrase levels are used for deriving the mapping function. Source and system level mapping functions are derived pitch synchronously, and the incorporation of target prosodic parameters is performed pitch synchronously using instants of significant excitation. The performance of the voice conversion system is evaluated using listening tests. The prediction accuracy of the mapping functions (neural network models) used at different levels in the proposed voice conversion system is further evaluated using objective measures such as deviation (D"i), root mean square error (@m"R"M"S"E) and correlation coefficient (@c"X","Y). The proposed approach (i.e., mapping and modification of parameters using pitch synchronous approach) used for voice conversion is shown to be performed better compared to the earlier method (mapping the vocal tract parameters using block processing) proposed by the author.