Emotion recognition from speech: Putting ASR in the loop

Authors:
Bjorn Schuller;Anton Batliner;Stefan Steidl;Dino Seppi
Affiliations:
Institute for Human-Machine Communication, Technische Universität München, Germany;Lehrstuhl für Mustererkennung, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany;Lehrstuhl für Mustererkennung, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany;Polderland Language&Speech Technology, Nijmegen, The Netherlands
Venue:
ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Year:
2009

Citing 0
Cited 6

Detecting emotional state of a child in a conversational computer game

Computer Speech and Language
Segmenting into adequate units for automatic recognition of emotion-related episodes: a speech-based approach

Advances in Human-Computer Interaction - Special issue on emotion-aware natural interaction
Determination of nonprototypical valence and arousal in popular music: features and performances

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on scalable audio-content analysis
On the impact of children's emotional speech on acoustic and language models

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on atypical speech
Tandem decoding of children's speech for keyword detection in a child-robot interaction scenario

ACM Transactions on Speech and Language Processing (TSLP)
Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Speech Communication

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper investigates the automatic recognition of emotion from spoken words by vector space modeling vs. string kernels which have not been investigated in this respect, yet. Apart from the spoken content directly, we integrate Part-of-Speech and higher semantic tagging in our analyses. As opposed to most works in the field, we evaluate the performance with an ASR engine in the loop. Extensive experiments are run on the FAU Aibo Emotion Corpus of 4k spontaneous emotional child-robot interactions and show surprisingly low performance degradation with real ASR over transcription-based emotion recognition. In the result, bag of words dominate over all other modeling forms based on the spoken content.