Combining video, audio and lexical indicators of affect in spontaneous conversation via particle filtering

Authors:
Arman Savran;Houwei Cao;Miraj Shah;Ani Nenkova;Ragini Verma
Affiliations:
University of Pennsylvania, Philadelphia, PA, USA;University of Pennsylvania, Philadelphia, PA, USA;University of Pennsylvania, Philadelphia, PA, USA;University of Pennsylvania, Philadelphia, PA, USA;University of Pennsylvania, Philadelphia, PA, USA
Venue:
Proceedings of the 14th ACM international conference on Multimodal interaction
Year:
2012

Citing 10
Cited 1

Opinion Mining and Sentiment Analysis

Foundations and Trends in Information Retrieval
A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions

IEEE Transactions on Pattern Analysis and Machine Intelligence
Class-level spectral features for emotion recognition

Speech Communication
From frequency to meaning: vector space models of semantics

Journal of Artificial Intelligence Research
Opensmile: the munich versatile and fast open-source audio feature extractor

Proceedings of the international conference on Multimedia
A computer model of the interpersonal effect of emotion displayed in a social dilemma

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part I
Multiple classifier systems for the classificatio of audio-visual emotional states

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
AVEC 2011-the first international audio/visual emotion challenge

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
Regression-based intensity estimation of facial action units

Image and Vision Computing
AVEC 2012: the continuous audio/visual emotion challenge

Proceedings of the 14th ACM international conference on Multimodal interaction

Audiovisual three-level fusion for continuous estimation of Russell's emotion circumplex

Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present experiments on fusing facial video, audio and lexical indicators for affect estimation during dyadic conversations. We use temporal statistics of texture descriptors extracted from facial video, a combination of various acoustic features, and lexical features to create regression based affect estimators for each modality. The single modality regressors are then combined using particle filtering, by treating these independent regression outputs as measurements of the affect states in a Bayesian filtering framework, where previous observations provide prediction about the current state by means of learned affect dynamics. Tested on the Audio-visual Emotion Recognition Challenge dataset, our single modality estimators achieve substantially higher scores than the official baseline method for every dimension of affect. Our filtering-based multi-modality fusion achieves correlation performance of 0.344 (baseline: 0.136) and 0.280 (baseline: 0.096) for the fully continuous and word level sub challenges, respectively.