Classification of emotional speech using 3DEC hierarchical classifier

  • Authors:
  • A. Hassan;R. I. Damper

  • Affiliations:
  • Signals, Images, Systems Research Group, School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK;Signals, Images, Systems Research Group, School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK

  • Venue:
  • Speech Communication
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The recognition of emotion from speech acoustics is an important problem in human-machine interaction, with many potential applications. In this paper, we first compare four ways to extend binary support vector machines (SVMs) to multiclass classification for recognising emotions from speech-namely two standard SVM schemes (one-versus-one and one-versus-rest) and two other methods (DAG and UDT) that form a hierarchy of classifiers, each making a distinct binary decision about class membership. These are trained and tested using 6552 features per speech sample extracted from three databases of acted emotional speech (DES, Berlin and Serbian) and a database of spontaneous speech (FAU Aibo Emotion Corpus) using the OpenEAR toolkit. Analysis of the errors made by these classifiers leads us to apply non-metric multi-dimensional scaling (NMDS) to produce a compact (two-dimensional) representation of the data suitable for guiding the choice of decision hierarchy. This representation can be interpreted in terms of the well-known valence-arousal model of emotion. We find that this model does not give a particularly good fit to the data: although the arousal dimension can be identified easily, valence is not well represented in the transformed data. We describe a new hierarchical classification technique whose structure is based on NMDS, which we call Data-Driven Dimensional Emotion Classification (3DEC). This new method is compared with the best of the four classifiers studied earlier and a state-of-the-art classification method on all four databases. We find no significant difference between these three approaches with respect to speaker-dependent performance. However, for the much more interesting and important case of speaker-independent emotion classification, 3DEC significantly outperforms the competitors.