Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech

Authors:
Roberto Barra-Chicote;Junichi Yamagishi;Simon King;Juan Manuel Montero;Javier Macias-Guarasa
Affiliations:
Grupo de Tecnologia del Habla, Universidad Politecnica de Madrid, ETSI Telecomunicación, Ciudad Universitaria s/n, 28040 Madrid, Spain;The Centre for Speech Technology Research, University of Edinburgh, Informatics Forum, 10 Crichton Street, Edinburgh EH8 9AB, United Kingdom;The Centre for Speech Technology Research, University of Edinburgh, Informatics Forum, 10 Crichton Street, Edinburgh EH8 9AB, United Kingdom;Grupo de Tecnologia del Habla, Universidad Politecnica de Madrid, ETSI Telecomunicación, Ciudad Universitaria s/n, 28040 Madrid, Spain;Department of Electronics, University of Alcala, Ctra. de Madrid-Barcelona, Km. 33,600, 28805 Alcalá de Henares, Madrid, Spain
Venue:
Speech Communication
Year:
2010

Citing 14
Cited 4

Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

Speech Communication
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

Speech Communication
Vocal communication of emotion: a review of research paradigms

Speech Communication - Special issue on speech and emotion
Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing

IEICE - Transactions on Information and Systems
A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features

IEICE - Transactions on Information and Systems
Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005

IEICE - Transactions on Information and Systems
Multisyn: Open-domain unit selection for the Festival speech synthesis system

Speech Communication
Unit selection in a concatenative speech synthesis system using a large speech database

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis

IEICE - Transactions on Information and Systems
A Hidden Semi-Markov Model-Based Speech Synthesis System

IEICE - Transactions on Information and Systems
A Style Control Technique for HMM-Based Expressive Speech Synthesis

IEICE - Transactions on Information and Systems
Review: Statistical parametric speech synthesis

Speech Communication
Robust speaker-adaptive HMM-based text-to-speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
The IBM expressive text-to-speech synthesis system for American English

IEEE Transactions on Audio, Speech, and Language Processing

LSESpeak: A spoken language generator for Deaf people

Expert Systems with Applications: An International Journal
I feel you: towards affect-sensitive domotic spoken conversational agents

IWAAL'12 Proceedings of the 4th international conference on Ambient Assisted Living and Home Care
Expressive speech synthesis: a review

International Journal of Speech Technology
A satisfaction-based model for affect recognition from conversational features in spoken dialog systems

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have applied two state-of-the-art speech synthesis techniques (unit selection and HMM-based synthesis) to the synthesis of emotional speech. A series of carefully designed perceptual tests to evaluate speech quality, emotion identification rates and emotional strength were used for the six emotions which we recorded -happiness, sadness, anger, surprise, fear, disgust. For the HMM-based method, we evaluated spectral and source components separately and identified which components contribute to which emotion. Our analysis shows that, although the HMM method produces significantly better neutral speech, the two methods produce emotional speech of similar quality, except for emotions having context-dependent prosodic patterns. Whilst synthetic speech produced using the unit selection method has better emotional strength scores than the HMM-based method, the HMM-based method has the ability to manipulate the emotional strength. For emotions that are characterized by both spectral and prosodic components, synthetic speech using unit selection methods was more accurately identified by listeners. For emotions mainly characterized by prosodic components, HMM-based synthetic speech was more accurately identified. This finding differs from previous results regarding listener judgements of speaker similarity for neutral speech. We conclude that unit selection methods require improvements to prosodic modeling and that HMM-based methods require improvements to spectral modeling for emotional speech. Certain emotions cannot be reproduced well by either method.