Data-driven emotion conversion in spoken English

Authors:
Zeynep Inanoglu;Steve Young
Affiliations:
University of Cambridge, Department of Engineering, Trumpington Street, Cambridge CB2 1PZ, UK;University of Cambridge, Department of Engineering, Trumpington Street, Cambridge CB2 1PZ, UK
Venue:
Speech Communication
Year:
2009

Citing 4
Cited 5

A corpus-based speech synthesis system with emotion

Speech Communication - Special issue on speech and emotion
Intonation modelling and adaptation for emotional prosody generation

ACII'05 Proceedings of the First international conference on Affective Computing and Intelligent Interaction
Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
Prosody conversion from neutral speech to emotional speech

IEEE Transactions on Audio, Speech, and Language Processing

Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques

IEEE Transactions on Audio, Speech, and Language Processing
Emotion conversion based on prosodic unit selection

IEEE Transactions on Audio, Speech, and Language Processing
Feature selection for improved phone duration modeling of greek emotional speech

SETN'10 Proceedings of the 6th Hellenic conference on Artificial Intelligence: theories, models and applications
Speech emotion modification using a cepstral vocoder

COST'09 Proceedings of the Second international conference on Development of Multimodal Interfaces: active Listening and Synchrony
Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes an emotion conversion system that combines independent parameter transformation techniques to endow a neutral utterance with a desired target emotion. A set of prosody conversion methods have been developed which utilise a small amount of expressive training data (~15min) and which have been evaluated for three target emotions: anger, surprise and sadness. The system performs F0 conversion at the syllable level while duration conversion takes place at the phone level using a set of linguistic regression trees. Two alternative methods are presented as a means to predict F0 contours for unseen utterances. Firstly, an HMM-based approach uses syllables as linguistic building blocks to model and generate F0 contours. Secondly, an F0 segment selection approach expresses F0 conversion as a search problem, where syllable-based F0 contour segments from a target speech corpus are spliced together under contextual constraints. To complement the prosody modules, a GMM-based spectral conversion function is used to transform the voice quality. Each independent module and the combined emotion conversion framework were evaluated through a perceptual study. Preference tests demonstrated that each module contributes a measurable improvement in the perception of the target emotion. Furthermore, an emotion classification test showed that converted utterances with either F0 generation technique were able to convey the desired emotion above chance level. However, F0 segment selection outperforms the HMM-based F0 generation method both in terms of emotion recognition rates as well as intonation quality scores, particularly in the case of anger and surprise. Using segment selection, the emotion recognition rates for the converted neutral utterances were comparable to the same utterances spoken directly in the target emotion.