Modeling the expressivity of input text semantics for Chinese text-to-speech synthesis in a spoken dialog system

Authors:
Zhiyong Wu;Helen M. Meng;Hongwu Yang;Lianhong Cai
Affiliations:
Dept. of Systems Engineering and Engineering Management, The Chinese Univ. of Hong Kong, China and Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems, Graduate School ...;Dept. of Systems Engineering and Engineering Management, The Chinese Univ. of Hong Kong, China and Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems, Graduate School ...;Department of Computer Science and Technology, Tsinghua University, Beijing, China;Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems, Graduate School at Shenzhen, Tsinghua Univ., Shenzhen, China and Dept. of Computer Science and Technology, Tsinghu ...
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2009

Citing 10
Cited 0

Affective computing

Affective computing
Emotional speech: towards a new generation of databases

Speech Communication - Special issue on speech and emotion
A corpus-based speech synthesis system with emotion

Speech Communication - Special issue on speech and emotion
Vocal communication of emotion: a review of research paradigms

Speech Communication - Special issue on speech and emotion
Visual Prosody: Facial Movements Accompanying Speech

FGR '02 Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition
ALMA: a layered model of affect

Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems
Levels of representation in the annotation of emotion for the specification of expressivity in ECAs

Lecture Notes in Computer Science
A corpus-based approach for cooperative response generation in a dialog system

ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing
Expressing degree of activation in synthetic speech

IEEE Transactions on Audio, Speech, and Language Processing
Prosody conversion from neutral speech to emotional speech

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work focuses on the development of expressive text-to-speech synthesis techniques for a Chinese spoken dialog system, where the expressivity is driven by the message content. We adapt the three-dimensional pleasure-displeasure, arousal-nonarousal and dominance-submissiveness (PAD) model for describing expressivity in input text semantics. The context of our study is based on response messages generated by a spoken dialog system in the tourist information domain. We use the P (pleasure) and A (arousal) dimensions to describe expressivity at the prosodic word level based on lexical semantics. The D (dominance) dimension is used to describe expressivity at the utterance level based on dialog acts. We analyze contrastive (neutral versus expressive) speech recordings to develop a nonlinear perturbation model that incorporates the PAD values of a response message to transform neutral speech into expressive speech. Two levels of perturbations are implemented--local perturbation at the prosodic word level, as well as global perturbation at the utterance level. Perceptual experiments involving 14 subjects indicate that the proposed approach can significantly enhance expressivity in response generation for a spoken dialog system.