Compensating for speaker or lexical variabilities in speech for emotion recognition

Authors:
Soroosh Mariooryad;Carlos Busso
Affiliations:
-;-
Venue:
Speech Communication
Year:
2014

Citing 15
Cited 0

Separating Style and Content with Bilinear Models

Neural Computation
Tied Factor Analysis for Face Recognition across Large Pose Differences

IEEE Transactions on Pattern Analysis and Machine Intelligence
Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing

ACII '07 Proceedings of the 2nd international conference on Affective Computing and Intelligent Interaction
Relative Speech Emotion Recognition Based Artificial Neural Network

PACIIA '08 Proceedings of the 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application - Volume 02
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Segmenting into adequate units for automatic recognition of emotion-related episodes: a speech-based approach

Advances in Human-Computer Interaction - Special issue on emotion-aware natural interaction
Opensmile: the munich versatile and fast open-source audio feature extractor

Proceedings of the international conference on Multimedia
Cross-Corpus Acoustic Emotion Recognition: Variances and Strategies

IEEE Transactions on Affective Computing
A segmental speech model with applications to word spotting

ICASSP'93 Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: speech processing - Volume II
Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Speech Communication
Vowels formants analysis allows straightforward detection of high arousal emotions

ICME '11 Proceedings of the 2011 IEEE International Conference on Multimedia and Expo
Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection

IEEE Transactions on Audio, Speech, and Language Processing
Speaker and Session Variability in GMM-Based Speaker Verification

IEEE Transactions on Audio, Speech, and Language Processing
Audio–Visual Affective Expression Recognition Through Multistream Fused HMM

IEEE Transactions on Multimedia
Exploring Cross-Modality Affective Reactions for Audiovisual Emotion Recognition

IEEE Transactions on Affective Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Affect recognition is a crucial requirement for future human machine interfaces to effectively respond to nonverbal behaviors of the user. Speech emotion recognition systems analyze acoustic features to deduce the speaker's emotional state. However, human voice conveys a mixture of information including speaker, lexical, cultural, physiological and emotional traits. The presence of these communication aspects introduces variabilities that affect the performance of an emotion recognition system. Therefore, building robust emotional models requires careful considerations to compensate for the effect of these variabilities. This study aims to factorize speaker characteristics, verbal content and expressive behaviors in various acoustic features. The factorization technique consists in building phoneme level trajectory models for the features. We propose a metric to quantify the dependency between acoustic features and communication traits (i.e., speaker, lexical and emotional factors). This metric, which is motivated by the mutual information framework, estimates the uncertainty reduction in the trajectory models when a given trait is considered. The analysis provides important insights on the dependency between the features and the aforementioned factors. Motivated by these results, we propose a feature normalization technique based on the whitening transformation that aims to compensate for speaker and lexical variabilities. The benefit of employing this normalization scheme is validated with the presented factor analysis method. The emotion recognition experiments show that the normalization approach can attenuate the variability imposed by the verbal content and speaker identity, yielding 4.1% and 2.4% relative performance improvements on a selected set of features, respectively.