LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework

Authors:
Martin WöLlmer;Moritz Kaiser;Florian Eyben;BjöRn Schuller;Gerhard Rigoll
Affiliations:
-;-;-;-;-
Venue:
Image and Vision Computing
Year:
2013

Citing 34
Cited 2

Robust Real-Time Face Detection

International Journal of Computer Vision
2005 Special Issue: Framewise phoneme classification with bidirectional LSTM and other neural network architectures

Neural Networks - 2005 Special issue: IJCNN 2005
Learning to Forget: Continual Prediction with LSTM

Neural Computation
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Long Short-Term Memory

Neural Computation
An analysis of facial expression recognition under partial facial image occlusion

Image and Vision Computing
The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data

ACII '07 Proceedings of the 2nd international conference on Affective Computing and Intelligent Interaction
Boosting encoded dynamic features for facial expression recognition

Pattern Recognition Letters
Audio-Visual Emotion Recognition Using Gaussian Mixture Models for Face and Voice

ISM '08 Proceedings of the 2008 Tenth IEEE International Symposium on Multimedia
Facial expression recognition based on Local Binary Patterns: A comprehensive study

Image and Vision Computing
Pose-Invariant Facial Expression Recognition Using Variable-Intensity Templates

International Journal of Computer Vision
Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Being bored? Recognising natural interest by extensive audiovisual integration for real-life application

Image and Vision Computing
A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams

Neurocomputing
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
A hierarchical approach for visual suspicious behavior detection in aircrafts

DSP'09 Proceedings of the 16th international conference on Digital Signal Processing
An application of recurrent neural networks to discriminative keyword spotting

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Opensmile: the munich versatile and fast open-source audio feature extractor

Proceedings of the international conference on Multimedia
Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space

IEEE Transactions on Affective Computing
The design and collection of COSINE, a multi-microphone in situ speech corpus recorded in noisy environments

Computer Speech and Language
Speech emotion recognition system based on L1 regularized linear regression and decision fusion

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
A psychologically-inspired match-score fusion mode for video-based facial expression recognition

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
Continuous emotion recognition using gabor energy filters

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
Multiple classifier systems for the classificatio of audio-visual emotional states

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
Investigating the use of formant based features for detection of affective dimensions in speech

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
Naturalistic affective expression classification by a multi-stage approach based on hidden Markov models

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
The CASIA audio emotion recognition method for audio/visual emotion challenge 2011

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
Modeling latent discriminative dynamic of multi-dimensional affective signals

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
Audio-based emotion recognition from natural conversations based on co-occurrence matrix and frequency domain energy distribution features

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
AVEC 2011-the first international audio/visual emotion challenge

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
Bidirectional recurrent neural networks

IEEE Transactions on Signal Processing
Audio–Visual Affective Expression Recognition Through Multistream Fused HMM

IEEE Transactions on Multimedia
Facial Expression Recognition in Image Sequences Using Geometric Deformation Features and Support Vector Machines

IEEE Transactions on Image Processing
Online Driver Distraction Detection Using Long Short-Term Memory

IEEE Transactions on Intelligent Transportation Systems

Editorial: Introduction To The Special Issue On Affect Analysis In Continuous Input

Image and Vision Computing
Audiovisual three-level fusion for continuous estimation of Russell's emotion circumplex

Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatically recognizing human emotions from spontaneous and non-prototypical real-life data is currently one of the most challenging tasks in the field of affective computing. This article presents our recent advances in assessing dimensional representations of emotion, such as arousal, expectation, power, and valence, in an audiovisual human-computer interaction scenario. Building on previous studies which demonstrate that long-range context modeling tends to increase accuracies of emotion recognition, we propose a fully automatic audiovisual recognition approach based on Long Short-Term Memory (LSTM) modeling of word-level audio and video features. LSTM networks are able to incorporate knowledge about how emotions typically evolve over time so that the inferred emotion estimates are produced under consideration of an optimal amount of context. Extensive evaluations on the Audiovisual Sub-Challenge of the 2011 Audio/Visual Emotion Challenge show how acoustic, linguistic, and visual features contribute to the recognition of different affective dimensions as annotated in the SEMAINE database. We apply the same acoustic features as used in the challenge baseline system whereas visual features are computed via a novel facial movement feature extractor. Comparing our results with the recognition scores of all Audiovisual Sub-Challenge participants, we find that the proposed LSTM-based technique leads to the best average recognition performance that has been reported for this task so far.