A corpus-based speech synthesis system with emotion

Authors:
Akemi Iida;Nick Campbell;Fumito Higuchi;Michiaki Yasumura
Affiliations:
Keio Research Institute at SFC, Keio University, 5322, Endo, Fujisawa-city, Kanagawa, 252-8520, Japan and JST (Japan Science and Technology), CREST, Kyoto, Japan;ATR Human Information Sciences Research Laboratories, Kyoto, Japan and JST (Japan Science and Technology), CREST, Kyoto, Japan;Graduate School of Media & Governance, Keio University, Kanagawa, Japan;Graduate School of Media & Governance, Keio University, Kanagawa, Japan
Venue:
Speech Communication - Special issue on speech and emotion
Year:
2003

Citing 1
Cited 25

Implementation and testing of a system for producing emotion-by-rule in synthetic speech

Speech Communication

A method for classifying emotion of text based on emotional dictionaries for emotional reading

AIA'06 Proceedings of the 24th IASTED international conference on Artificial intelligence and applications
Conversion Function Clustering and Selection Using Linguistic and Spectral Information for Emotional Voice Conversion

IEEE Transactions on Computers
Emotions in Speech: Juristic Implications

Speaker Classification I
Data-driven emotion conversion in spoken English

Speech Communication
Application of Expressive Speech in TTS System with Cepstral Description

Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction
A Style Control Technique for HMM-Based Expressive Speech Synthesis

IEICE - Transactions on Information and Systems
Modeling the expressivity of input text semantics for Chinese text-to-speech synthesis in a spoken dialog system

IEEE Transactions on Audio, Speech, and Language Processing
Accessibility of board and presentations in the classroom: a design-for-all approach

Telehealth/AT '08 Proceedings of the IASTED International Conference on Telehealth/Assistive Technologies
Spoken emotion recognition through optimum-path forest classification using glottal features

Computer Speech and Language
Emotional style conversion in the TTS system with cepstral description

COST 2102'07 Proceedings of the 2007 COST action 2102 international conference on Verbal and nonverbal communication behaviours
Meaningful parameters in emotion characterisation

COST 2102'07 Proceedings of the 2007 COST action 2102 international conference on Verbal and nonverbal communication behaviours
Estonian Emotional Speech Corpus: Culture and Age in Selecting Corpus Testers

Proceedings of the 2010 conference on Human Language Technologies -- The Baltic Perspective: Proceedings of the Fourth International Conference Baltic HLT 2010
Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
Intonation modelling and adaptation for emotional prosody generation

ACII'05 Proceedings of the First international conference on Affective Computing and Intelligent Interaction
Toward a rule-based synthesis of emotional speech on linguistic descriptions of perception

ACII'05 Proceedings of the First international conference on Affective Computing and Intelligent Interaction
Emotional speech synthesis based on improved codebook mapping voice conversion

ACII'05 Proceedings of the First international conference on Affective Computing and Intelligent Interaction
Analysis of the suitability of common corpora for emotional speech modeling in standard basque

TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue
Modification of the glottal voice characteristics based on changing the maximum-phase speech component

COST'10 Proceedings of the 2010 international conference on Analysis of Verbal and Nonverbal Communication and Enactment
Emotion recognition from speech: a review

International Journal of Speech Technology
Emotion recognition from speech using source, system, and prosodic features

International Journal of Speech Technology
Duration modeling for emotional speech

ICICA'12 Proceedings of the Third international conference on Information Computing and Applications
An intuitive style control technique in HMM-based expressive speech synthesis using subjective style intensity and multiple-regression global variance model

Speech Communication
Emotion recognition from speech using global and local prosodic features

International Journal of Speech Technology
Characterization and recognition of emotions from speech using excitation source information

International Journal of Speech Technology
Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new approach to synthesizing emotional speech by a corpus-based concatenative speech synthesis system (ATR CHATR) using speech corpora of emotional speech. In this study, neither emotional-dependent prosody prediction nor signal processing per se is performed for emotional speech. Instead, a large speech corpus is created per emotion to synthesize speech with the appropriate emotion by simple switching between the emotional corpora. This is made possible by the normalization procedure incorporated in CHATR that transforms its standard predicted prosody range according to the source database in use. We evaluate our approach by creating three kinds of emotional speech corpus (anger, joy, and sadness) from recordings of a male and a female speaker of Japanese. The acoustic characteristics of each corpus are different and the emotions identifiable. The acoustic characteristics of each emotional utterance synthesized by our method show clear correlations to those of each corpus. Perceptual experiments using synthesized speech confirmed that our method can synthesize recognizably emotional speech. We further evaluated the method's intelligibility and the overall impression it gives to the listeners. The results show that the proposed method can synthesize speech with a high intelligibility and gives a favorable impression. With these encouraging results, we have developed a workable text-to-speech system with emotion to support the immediate needs of nonspeaking individuals. This paper describes the proposed method, the design and acoustic characteristics of the corpora, and the results of the perceptual evaluations.