Fundamentals of speech recognition
Fundamentals of speech recognition
Assessing agreement on classification tasks: the kappa statistic
Computational Linguistics
Toward Machine Emotional Intelligence: Analysis of Affective Physiological State
IEEE Transactions on Pattern Analysis and Machine Intelligence - Graph Algorithms and Computer Vision
Modeling drivers' speech under stress
Speech Communication - Special issue on speech and emotion
2005 Special Issue: Emotion recognition in human-computer interaction
Neural Networks - Special issue: Emotion and brain
Ensemble methods for spoken emotion recognition in call-centres
Speech Communication
Primitives-based evaluation and estimation of emotions in speech
Speech Communication
Joint-sequence models for grapheme-to-phoneme conversion
Speech Communication
Automatic Classification of Expressiveness in Speech: A Multi-corpus Study
Speaker Classification II
Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing
ACII '07 Proceedings of the 2nd international conference on Affective Computing and Intelligent Interaction
What Should a Generic Emotion Markup Language Be Able to Represent?
ACII '07 Proceedings of the 2nd international conference on Affective Computing and Intelligent Interaction
A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions
IEEE Transactions on Pattern Analysis and Machine Intelligence
Data-driven emotion conversion in spoken English
Speech Communication
Image and Vision Computing
Comparison of Different Classifiers for Emotion Recognition
PCI '09 Proceedings of the 2009 13th Panhellenic Conference on Informatics
Computer Speech and Language
Advances in Human-Computer Interaction - Special issue on emotion-aware natural interaction
Cross-Corpus Acoustic Emotion Recognition: Variances and Strategies
IEEE Transactions on Affective Computing
CVPR'03 Proceedings of the 2003 IEEE computer society conference on Computer vision and pattern recognition
ICME '11 Proceedings of the 2011 IEEE International Conference on Multimedia and Expo
Hi-index | 0.00 |
The role of automatic emotion recognition from speech is growing continuously because of the accepted importance of reacting to the emotional state of the user in human-computer interaction. Most state-of-the-art emotion recognition methods are based on turn- and frame-level analysis independent from phonetic transcription. Here, we are interested in a phoneme-based classification of the level of arousal in acted and spontaneous emotions. To start, we show that our previously published classification technique which showed high-level results in the Interspeech 2009 Emotion Challenge cannot provide sufficiently good classification in cross-corpora evaluation (a condition close to real-life applications). To prove the robustness of our emotion classification techniques we use cross-corpora evaluation for a simplified two-class problem; namely high and low arousal emotions. We use emotion classes on a phoneme-level for classification. We build our speaker-independent emotion classifier with HMMs, using GMMs-based production probabilities and MFCC features. This classifier performs equally well when using a complete phoneme set, as it does in the case of a reduced set of indicative vowels (7 out of 39 phonemes in the German SAM-PA list). Afterwards we compare emotion classification performance of the technique used in the Emotion Challenge with phoneme-based classification within the same experimental setup. With phoneme-level emotion classes we increase cross-corpora classification performance by about 3.15% absolute (4.69% relative) for models trained on acted emotions (EMO-DB dataset) and evaluated on spontaneous emotions (VAM dataset); within vice versa experimental conditions (trained on VAM, tested on EMO-DB) we obtain 15.43% absolute (23.20% relative) improvement. We show that using phoneme-level emotion classes can improve classification performance even with comparably low speech recognition performance obtained with scant a priori knowledge about the language, implemented as a zero-gram for word-level modeling and a bi-gram for phoneme-level modeling. Finally we compare our results with the state-of-the-art cross-corpora evaluations on the VAM database. For training our models, we use an almost 15 times smaller training set, consisting of 456 utterances (210 low and 246 high arousal emotions) instead of 6820 utterances (4685 high and 2135 low arousal emotions). We are yet able to increase cross-corpora classification performance by about 2.25% absolute (3.22% relative) from UA=69.7% obtained by Zhang et al. to UA=71.95%.