Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications

Authors:
Bogdan Vlasenko;Dmytro Prylipko;Ronald Böck;Andreas Wendemuth
Affiliations:
Cognitive Systems, IESK, Department of Electrical Engineering and Information Technology, & Center for Behavioral Brain Sciences, Otto von Guericke University, D-39106 Magdeburg, Germany;Cognitive Systems, IESK, Department of Electrical Engineering and Information Technology, & Center for Behavioral Brain Sciences, Otto von Guericke University, D-39106 Magdeburg, Germany;Cognitive Systems, IESK, Department of Electrical Engineering and Information Technology, & Center for Behavioral Brain Sciences, Otto von Guericke University, D-39106 Magdeburg, Germany;Cognitive Systems, IESK, Department of Electrical Engineering and Information Technology, & Center for Behavioral Brain Sciences, Otto von Guericke University, D-39106 Magdeburg, Germany
Venue:
Computer Speech and Language
Year:
2014

Citing 22
Cited 0

Fundamentals of speech recognition

Fundamentals of speech recognition
Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Toward Machine Emotional Intelligence: Analysis of Affective Physiological State

IEEE Transactions on Pattern Analysis and Machine Intelligence - Graph Algorithms and Computer Vision
Modeling drivers' speech under stress

Speech Communication - Special issue on speech and emotion
2005 Special Issue: Emotion recognition in human-computer interaction

Neural Networks - Special issue: Emotion and brain
Ensemble methods for spoken emotion recognition in call-centres

Speech Communication
Primitives-based evaluation and estimation of emotions in speech

Speech Communication
Joint-sequence models for grapheme-to-phoneme conversion

Speech Communication
Automatic Classification of Expressiveness in Speech: A Multi-corpus Study

Speaker Classification II
Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing

ACII '07 Proceedings of the 2nd international conference on Affective Computing and Intelligent Interaction
What Should a Generic Emotion Markup Language Be Able to Represent?

ACII '07 Proceedings of the 2nd international conference on Affective Computing and Intelligent Interaction
A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions

IEEE Transactions on Pattern Analysis and Machine Intelligence
Data-driven emotion conversion in spoken English

Speech Communication
Being bored? Recognising natural interest by extensive audiovisual integration for real-life application

Image and Vision Computing
Comparison of Different Classifiers for Emotion Recognition

PCI '09 Proceedings of the 2009 13th Panhellenic Conference on Informatics
Whodunnit - Searching for the most important feature types signalling emotion-related user states in speech

Computer Speech and Language
Segmenting into adequate units for automatic recognition of emotion-related episodes: a speech-based approach

Advances in Human-Computer Interaction - Special issue on emotion-aware natural interaction
Survey on speech emotion recognition: Features, classification schemes, and databases

Pattern Recognition
Cross-Corpus Acoustic Emotion Recognition: Variances and Strategies

IEEE Transactions on Affective Computing
Learning bayesian network classifiers for facial expression recognition using both labeled and unlabeled data

CVPR'03 Proceedings of the 2003 IEEE computer society conference on Computer vision and pattern recognition
Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Speech Communication
Appropriate emotional labelling of non-acted speech using basic emotions, geneva emotion wheel and self assessment manikins

ICME '11 Proceedings of the 2011 IEEE International Conference on Multimedia and Expo

Quantified Score

Hi-index	0.00

Visualization

Abstract

The role of automatic emotion recognition from speech is growing continuously because of the accepted importance of reacting to the emotional state of the user in human-computer interaction. Most state-of-the-art emotion recognition methods are based on turn- and frame-level analysis independent from phonetic transcription. Here, we are interested in a phoneme-based classification of the level of arousal in acted and spontaneous emotions. To start, we show that our previously published classification technique which showed high-level results in the Interspeech 2009 Emotion Challenge cannot provide sufficiently good classification in cross-corpora evaluation (a condition close to real-life applications). To prove the robustness of our emotion classification techniques we use cross-corpora evaluation for a simplified two-class problem; namely high and low arousal emotions. We use emotion classes on a phoneme-level for classification. We build our speaker-independent emotion classifier with HMMs, using GMMs-based production probabilities and MFCC features. This classifier performs equally well when using a complete phoneme set, as it does in the case of a reduced set of indicative vowels (7 out of 39 phonemes in the German SAM-PA list). Afterwards we compare emotion classification performance of the technique used in the Emotion Challenge with phoneme-based classification within the same experimental setup. With phoneme-level emotion classes we increase cross-corpora classification performance by about 3.15% absolute (4.69% relative) for models trained on acted emotions (EMO-DB dataset) and evaluated on spontaneous emotions (VAM dataset); within vice versa experimental conditions (trained on VAM, tested on EMO-DB) we obtain 15.43% absolute (23.20% relative) improvement. We show that using phoneme-level emotion classes can improve classification performance even with comparably low speech recognition performance obtained with scant a priori knowledge about the language, implemented as a zero-gram for word-level modeling and a bi-gram for phoneme-level modeling. Finally we compare our results with the state-of-the-art cross-corpora evaluations on the VAM database. For training our models, we use an almost 15 times smaller training set, consisting of 456 utterances (210 low and 246 high arousal emotions) instead of 6820 utterances (4685 high and 2135 low arousal emotions). We are yet able to increase cross-corpora classification performance by about 2.25% absolute (3.22% relative) from UA=69.7% obtained by Zhang et al. to UA=71.95%.