Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing

Authors:
Bogdan Vlasenko;Björn Schuller;Andreas Wendemuth;Gerhard Rigoll
Affiliations:
Cognitive Systems, IESK, Otto-von-Guericke University, Magdeburg, Germany;Institute for Human-Machine Communication, Technische Universität München, Germany;Cognitive Systems, IESK, Otto-von-Guericke University, Magdeburg, Germany;Institute for Human-Machine Communication, Technische Universität München, Germany
Venue:
ACII '07 Proceedings of the 2nd international conference on Affective Computing and Intelligent Interaction
Year:
2007

Citing 2
Cited 8

Speaker identification and verification using Gaussian mixture speaker models

Speech Communication
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations

Neurobiologically Inspired, Multimodal Intention Recognition for Technical Communication Systems (NIMITEK)

PIT '08 Proceedings of the 4th IEEE tutorial and research workshop on Perception and Interactive Technologies for Speech-Based Systems: Perception in Multimodal Dialogue Systems
Whodunnit - Searching for the most important feature types signalling emotion-related user states in speech

Computer Speech and Language
Formant frequencies under cognitive load: effects and classification

EURASIP Journal on Advances in Signal Processing - Special issue on emotion and mental state recognition from speech
Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Speech Communication
Formant position based weighted spectral features for emotion recognition

Speech Communication
A multitask approach to continuous five-dimensional affect sensing in natural speech

ACM Transactions on Interactive Intelligent Systems (TiiS) - Special Issue on Affective Interaction in Natural Environments
Compensating for speaker or lexical variabilities in speech for emotion recognition

Speech Communication
Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

Opposing the pre-dominant turn-wise statistics of acoustic Low-Level-Descriptors followed by static classification we re-investigate dynamic modeling directly on the frame-level in speech-based emotion recognition. This seems beneficial, as it is well known that important information on temporal sub-turn-layers exists. And, most promisingly, we integrate this frame-level information within a state-of-the-art large-feature-space emotion recognition engine. In order to investigate frame-level processing we employ a typical speaker-recognition set-up tailored for the use of emotion classification. That is a GMM for classification and MFCC plus speed and acceleration coefficients as features. We thereby also consider use of multiple states, respectively an HMM. In order to fuse this information with turn-based modeling, output scores are added to a super-vector combined with static acoustic features. Thereby a variety of Low-Level-Descriptors and functionals to cover prosodic, speech quality, and articulatory aspects are considered. Starting from 1.4k features we select optimal configurations including and excluding GMM information. The final decision task is realized by use of SVM. Extensive test-runs are carried out on two popular public databases, namely EMO-DB and SUSAS, to investigate acted and spontaneous data. As we face the current challenge of speaker-independent analysis we also discuss benefits arising from speaker normalization. The results obtained clearly emphasize the superior power of integrated diverse time-levels.