A modular holistic approach to prosody modelling for Standard Yorùbá speech synthesis

Authors:
dtúnjí A. djbí;Shun Ha Sylvia Wong;Anthony J. Beaumont
Affiliations:
Room 109, Computer Buildings, Computer Science and Engineering Department, báfmi Awólw` University, Ilé-If`, Nigeria;Computer Science, Aston University, Aston Triangle, Birmingham B4 7ET, UK;Computer Science, Aston University, Aston Triangle, Birmingham B4 7ET, UK
Venue:
Computer Speech and Language
Year:
2008

Citing 11
Cited 1

On the perceptual analysis of intonation

Speech Communication
Analysis and synthesis of German F0 contours by means of Fujisaki's model

Speech Communication - Special issue: Fujisaki's Festschrift
The SUS test: a method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences

Speech Communication
Intelligibility of normal speech I: global and fine-grained acoustic-phonetic talker characteristics

Speech Communication - Special issue on acoustic echo control and speech enhancement techniques
Generating prosodic attitudes in French: data, model and evaluation

Speech Communication
RNN-based prosodic modeling for mandarin speech and its application to speech-to-text conversion

Speech Communication
Prosody modeling with soft templates

Speech Communication
Data-driven generation of F0 contours using a superpositional model

Speech Communication
A fuzzy decision tree-based duration model for Standard Yorùbá text-to-speech synthesis

Computer Speech and Language
Representation of Random Waveforms by Relational Trees

IEEE Transactions on Computers
A novel prosodic-information synthesizer based on recurrent fuzzy neural network for the Chinese TTS system

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Predicting utterance pitch targets in Yorùbá for tone realisation in speech synthesis

Speech Communication

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents a novel prosody model in the context of computer text-to-speech synthesis applications for tone languages. We have demonstrated its applicability using the Standard Yoruba (SY) language. Our approach is motivated by the theory that abstract and realised forms of various prosody dimensions should be modelled within a modular and unified framework [Coleman, J.S., 1994. Polysyllabic words in the YorkTalk synthesis system. In: Keating, P.A. (Ed.), Phonological Structure and Forms: Papers in Laboratory Phonology III, Cambridge University Press, Cambridge, pp. 293-324]. We have implemented this framework using the Relational Tree (R-Tree) technique. R-Tree is a sophisticated data structure for representing a multi-dimensional waveform in the form of a tree. The underlying assumption of this research is that it is possible to develop a practical prosody model by using appropriate computational tools and techniques which combine acoustic data with an encoding of the phonological and phonetic knowledge provided by experts. To implement the intonation dimension, fuzzy logic based rules were developed using speech data from native speakers of Yoruba. The Fuzzy Decision Tree (FDT) and the Classification and Regression Tree (CART) techniques were tested in modelling the duration dimension. For practical reasons, we have selected the FDT for implementing the duration dimension of our prosody model. To establish the effectiveness of our prosody model, we have also developed a Stem-ML prosody model for SY. We have performed both quantitative and qualitative evaluations on our implemented prosody models. The results suggest that, although the R-Tree model does not predict the numerical speech prosody data as accurately as the Stem-ML model, it produces synthetic speech prosody with better intelligibility and naturalness. The R-Tree model is particularly suitable for speech prosody modelling for languages with limited language resources and expertise, e.g. African languages. Furthermore, the R-Tree model is easy to implement, interpret and analyse.