Two-stage intonation modeling using feedforward neural networks for syllable based text-to-speech synthesis

Authors:
V. Ramu Reddy;K. Sreenivasa Rao
Affiliations:
School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur 721302, West Bengal, India;School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur 721302, West Bengal, India
Venue:
Computer Speech and Language
Year:
2013

Citing 11
Cited 1

The rise/fall/connection model of intonation

Speech Communication
Rules for the generation of ToBI-based American English intonation

Speech Communication
Developments and paradigms in intonation research

Speech Communication
Multilingual Text-to-Speech Synthesis

Multilingual Text-to-Speech Synthesis
Prosody Generation with a Neural Network: Weighing the Importance of Input Parameters

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Modeling durations of syllables using neural networks

Computer Speech and Language
Intonation modeling for Indian languages

Computer Speech and Language
Development of syllable-based text to speech synthesis system in Bengali

International Journal of Speech Technology
Modeling the effects of emphasis and question on fundamental frequency contours of Cantonese utterances

IEEE Transactions on Audio, Speech, and Language Processing
Epoch Extraction From Speech Signals

IEEE Transactions on Audio, Speech, and Language Processing
Capabilities of a four-layered feedforward neural network: four layers versus three

IEEE Transactions on Neural Networks

Film segmentation and indexing using autoassociative neural networks

International Journal of Speech Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a two-stage feedforward neural network (FFNN) based approach for modeling fundamental frequency (F"0) values of a sequence of syllables. In this study, (i) linguistic constraints represented by positional, contextual and phonological features, (ii) production constraints represented by articulatory features and (iii) linguistic relevance tilt parameters are proposed for predicting intonation patterns. In the first stage, tilt parameters are predicted using linguistic and production constraints. In the second stage, F"0 values of the syllables are predicted using the tilt parameters predicted from the first stage, and basic linguistic and production constraints. The prediction performance of the neural network models is evaluated using objective measures such as average prediction error (@m), standard deviation (@s) and linear correlation coefficient (@c"X","Y). The prediction accuracy of the proposed two-stage FFNN model is compared with other statistical models such as Classification and Regression Tree (CART) and Linear Regression (LR) models. The prediction accuracy of the intonation models is also analyzed by conducting listening tests to evaluate the quality of synthesized speech obtained after incorporation of intonation models into the baseline system. From the evaluation, it is observed that prediction accuracy is better for two-stage FFNN models, compared to the other models.