Multimodal emotion recognition based on the decoupling of emotion and speaker information

Authors:
Rok Gajýek;Vitomir ýtruc;France Mihelič
Affiliations:
Faculty of Electrical Engineering, University of Ljubljana, Ljubljana, Slovenia;Faculty of Electrical Engineering, University of Ljubljana, Ljubljana, Slovenia;Faculty of Electrical Engineering, University of Ljubljana, Ljubljana, Slovenia
Venue:
TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Year:
2010

Citing 6
Cited 0

Face Recognition Using Temporal Image Sequence

FG '98 Proceedings of the 3rd. International Conference on Face & Gesture Recognition
Robust Real-Time Face Detection

International Journal of Computer Vision
Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy

IEEE Transactions on Pattern Analysis and Machine Intelligence
The eNTERFACE'05 Audio-Visual Emotion Database

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations

IEEE Transactions on Pattern Analysis and Machine Intelligence
Score normalization in multimodal biometric systems

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

The standard features used in emotion recognition carry, besides the emotion related information, also cues about the speaker. This is expected, since the nature of emotionally colored speech is similar to the variations in the speech signal, caused by different speakers. Therefore, we present a gradient descent derived transformation for the decoupling of emotion and speaker information contained in the acoustic features. The Interspeech '09 Emotion Challenge feature set is used as the baseline for the audio part. A similar procedure is employed on the video signal, where the nuisance attribute projection (NAP) is used to derive the transformation matrix, which contains information about the emotional state of the speaker. Ultimately, different NAP transformation matrices are compared using canonical correlations. The audio and video sub-systems are combined at the matching score level using different fusion techniques. The presented system is assessed on the publicly available eNTERFACE '05 database where significant improvements in the recognition performance are observed when compared to the stat-of-the-art baseline.