Primitives-based evaluation and estimation of emotions in speech

  • Authors:
  • Michael Grimm;Kristian Kroschel;Emily Mower;Shrikanth Narayanan

  • Affiliations:
  • Universität Karlsruhe (TH), Institut für Nachrichtentechnik (INT), Kaiserstraíe 12, 76128 Karlsruhe, Germany;Universität Karlsruhe (TH), Institut für Nachrichtentechnik (INT), Kaiserstraíe 12, 76128 Karlsruhe, Germany;University of Southern California (USC), Speech Analysis and Interpretation Laboratory (SAIL), 3740 McClintock Avenue, Los Angeles, CA 90089, USA;University of Southern California (USC), Speech Analysis and Interpretation Laboratory (SAIL), 3740 McClintock Avenue, Los Angeles, CA 90089, USA

  • Venue:
  • Speech Communication
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Emotion primitive descriptions are an important alternative to classical emotion categories for describing a human's affective expressions. We build a multi-dimensional emotion space composed of the emotion primitives of valence, activation, and dominance. In this study, an image-based, text-free evaluation system is presented that provides intuitive assessment of these emotion primitives, and yields high inter-evaluator agreement. An automatic system for estimating the emotion primitives is introduced. We use a fuzzy logic estimator and a rule base derived from acoustic features in speech such as pitch, energy, speaking rate and spectral characteristics. The approach is tested on two databases. The first database consists of 680 sentences of 3 speakers containing acted emotions in the categories happy, angry, neutral, and sad. The second database contains more than 1000 utterances of 47 speakers with authentic emotion expressions recorded from a television talk show. The estimation results are compared to the human evaluation as a reference, and are moderately to highly correlated (0.42r Finally, continuous-valued estimates of the emotion primitives are mapped into the given emotion categories using a k-nearest neighbor classifier. An overall recognition rate of up to 83.5% is accomplished. The errors of the direct emotion estimation are compared to the confusion matrices of the classification from primitives. As a conclusion to this continuous-valued emotion primitives framework, speaker-dependent modeling of emotion expression is proposed since the emotion primitives are particularly suited for capturing dynamics and intrinsic variations in emotion expression.