Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection

Authors:
C. Busso;Sungbok Lee;S. Narayanan
Affiliations:
Signal & Image Process. Inst., Univ. of Southern California, Los Angeles, CA;-;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2009

Citing 0
Cited 15

Survey on speech emotion recognition: Features, classification schemes, and databases

Pattern Recognition
Automatic speech emotion recognition using modulation spectral features

Speech Communication
Emotion recognition using a hierarchical binary decision tree approach

Speech Communication
Investigating acoustic cues in automatic detection of learners' emotion from auto tutor

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
Dimensionality reduction and classification analysis on the audio section of the SEMAINE database

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
Investigating glottal parameters and teager energy operators in emotion recognition

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part II
Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema

International Journal of Speech Technology
Speaker state recognition using an HMM-based feature extraction method

Computer Speech and Language
Toward automating a human behavioral coding system for married couples' interactions using speech acoustic features

Speech Communication
Actor level emotion magnitude prediction in text and speech

Multimedia Tools and Applications
Emotion-aware assistive system for humanistic care based on the orange computing concept

Applied Computational Intelligence and Soft Computing - Special issue on Awareness Science and Engineering
Shape-based modeling of the fundamental frequency contour for emotion detection in speech

Computer Speech and Language
Compensating for speaker or lexical variabilities in speech for emotion recognition

Speech Communication
Class-specific multiple classifiers scheme to recognize emotions from speech signals

Computer Speech and Language
Human emotion recognition from videos using spatio-temporal and audio features

The Visual Computer: International Journal of Computer Graphics

Quantified Score

Hi-index	0.00

Visualization

Abstract

During expressive speech, the voice is enriched to convey not only the intended semantic message but also the emotional state of the speaker. The pitch contour is one of the important properties of speech that is affected by this emotional modulation. Although pitch features have been commonly used to recognize emotions, it is not clear what aspects of the pitch contour are the most emotionally salient. This paper presents an analysis of the statistics derived from the pitch contour. First, pitch features derived from emotional speech samples are compared with the ones derived from neutral speech, by using symmetric Kullback-Leibler distance. Then, the emotionally discriminative power of the pitch features is quantified by comparing nested logistic regression models. The results indicate that gross pitch contour statistics such as mean, maximum, minimum, and range are more emotionally prominent than features describing the pitch shape. Also, analyzing the pitch statistics at the utterance level is found to be more accurate and robust than analyzing the pitch statistics for shorter speech regions (e.g., voiced segments). Finally, the best features are selected to build a binary emotion detection system for distinguishing between emotional versus neutral speech. A new two-step approach is proposed. In the first step, reference models for the pitch features are trained with neutral speech, and the input features are contrasted with the neutral model. In the second step, a fitness measure is used to assess whether the input speech is similar to, in the case of neutral speech, or different from, in the case of emotional speech, the reference models. The proposed approach is tested with four acted emotional databases spanning different emotional categories, recording settings, speakers and languages. The results show that the recognition accuracy of the system is over 77% just with the pitch features (baseline 50%). When compared to conventional classification schemes, th- - e proposed approach performs better in terms of both accuracy and robustness.