Developments in Corpus-Based Speech Synthesis: Approaching Natural Conversational Speech

Authors:
Nick Campbell
Affiliations:
The author is with the Department of Emergent Communication of the ATR Network Informatics Laboratories, Kyoto-fu, 619-0288 Japan. E-mail: nick@atr.jp
Venue:
IEICE - Transactions on Information and Systems
Year:
2005

Citing 0
Cited 6

Perceptual continuity and naturalness of expressive strength in singing voices based on speech morphing

EURASIP Journal on Audio, Speech, and Music Processing
Automatic refinement of an expressive speech corpus assembling subjective perception and automatic classification

Speech Communication
Objective and subjective evaluation of an expressive speech corpus

NOLISP'07 Proceedings of the 2007 international conference on Advances in nonlinear speech processing
Candidacy of physiological measurements for implicit control of emotional speech synthesis

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
Expressive speech synthesis: a review

International Journal of Speech Technology
Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the special demands of conversational speech in the context of corpus-based speech synthesis. The author proposed the CHATR system of prosody-based unit-selection for concatenative waveform synthesis seven years ago, and now extends this work to incorporate the results of an analysis of five-years of recordings of spontaneous conversational speeech in a wide range of actual daily-life situations. The paper proposes that the expresion of affect (often translated as 'kansei' in Japanese) is the main factor differentiating laboratory speech from realworld conversational speech, and presents a framework for the specification of affect through differences in speaking style and voice quality. Having an enormous corpus of speech samples available for concatenation allows the selection of complete phrase-sized utterance segments, and changes the focus of unit selection from segmental or phonetic continuity to one of prosodic and discoursal appropriateness instead. Samples of the resulting large-corpus-based synthesis can be heard at http://feast.his.atr.jp/AESOP.