Recognizing affect from speech prosody using hierarchical graphical models

Authors:
Raul Fernandez;Rosalind Picard
Affiliations:
IBM TJ Watson Research Center, Yorktown Heights, NY 10598, USA;MIT Media Laboratory, 75 Amherst St. E14-374G, Cambridge, MA 02139, USA
Venue:
Speech Communication
Year:
2011

Citing 9
Cited 1

Phonology and syntax: the relationship between sound and structure

Phonology and syntax: the relationship between sound and structure
Describing the emotional states that are expressed in speech

Speech Communication - Special issue on speech and emotion
The role of voice quality in communicating emotion, mood and attitude

Speech Communication - Special issue on speech and emotion
A computational model for the automatic recognition of affect in speech

A computational model for the automatic recognition of affect in speech
Automatic recognition of prosodic phrases

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
Emotion Recognition Based on Joint Visual and Audio Cues

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 01
Multi-level Speech Emotion Recognition Based on HMM and ANN

CSIE '09 Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 07
Audio-Visual Emotion Recognition Based on a DBN Model with Constrained Asynchrony

ICIG '09 Proceedings of the 2009 Fifth International Conference on Image and Graphics
Automatic recognition of intonational features

ICASSP'92 Proceedings of the 1992 IEEE international conference on Acoustics, speech and signal processing - Volume 1

Continuous emotion recognition with phonetic syllables

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work we develop and apply a class of hierarchical directed graphical models on the task of recognizing affective categories from prosody in both acted and natural speech. A strength of this new approach is the integration and summarization of information using both local (e.g., syllable level) and global prosodic phenomena (e.g., utterance level). In this framework speech is structurally modeled as a dynamically evolving hierarchical model in which levels of the hierarchy are determined by prosodic constituency and contain parameters that evolve according to dynamical systems. The acoustic parameters have been chosen to reflect four main components of speech thought to reflect paralinguistic and affect-specific information: intonation, loudness, rhythm and voice quality. The work is first evaluated on a database of acted emotions and compared to human perceptual recognition of five affective categories where it achieves rates within nearly 10% of human recognition accuracy despite only focusing on prosody. The model is then evaluated on two different corpora of fully spontaneous, affectively-colored, naturally occurring speech between people: Call Home English and BT Call Center. Here the ground truth labels are obtained from examining the agreement of 29 human coders labeling arousal and valence. The best discrimination performance on the natural spontaneous speech, using only the prosody features, obtains a 70% detection rate with 30% false alarms when detecting high arousal negative valence speech in call centers.