Automatic speech emotion recognition using modulation spectral features

Authors:
Siqing Wu;Tiago H. Falk;Wai-Yip Chan
Affiliations:
Department of Electrical and Computer Engineering, Queen's University, Kingston, ON, Canada K7L 3N6;Institute of Biomaterials and Biomedical Engineering, University of Toronto, Toronto, ON, Canada M5S 3G9;Department of Electrical and Computer Engineering, Queen's University, Kingston, ON, Canada K7L 3N6
Venue:
Speech Communication
Year:
2011

Citing 14
Cited 8

Fundamentals of speech recognition

Fundamentals of speech recognition
The nature of statistical learning theory

The nature of statistical learning theory
Emotional speech: towards a new generation of databases

Speech Communication - Special issue on speech and emotion
Vocal communication of emotion: a review of research paradigms

Speech Communication - Special issue on speech and emotion
An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech

Speech Communication
Bi-modal emotion recognition from expressive face and body gestures

Journal of Network and Computer Applications
Joint acoustic and modulation frequency

EURASIP Journal on Applied Signal Processing
Primitives-based evaluation and estimation of emotions in speech

Speech Communication
Fear-type emotion recognition for future audio-based surveillance systems

Speech Communication
Investigating glottal parameters for differentiating emotional categories with similar prosodics

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
A dimensional approach to emotion recognition of speech from movies

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Automatic recognition of speech emotion using long-term spectro-temporal features

DSP'09 Proceedings of the 16th international conference on Digital Signal Processing
Modulation spectral features for robust far-field speaker identification

IEEE Transactions on Audio, Speech, and Language Processing
Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection

IEEE Transactions on Audio, Speech, and Language Processing

Application of nonlinear dynamics characterization to emotional speech

NOLISP'11 Proceedings of the 5th international conference on Advances in nonlinear speech processing
Paralinguistics in speech and language-State-of-the-art and the challenge

Computer Speech and Language
Ubiquitous emotion-aware computing

Personal and Ubiquitous Computing
Cross-validation of bimodal health-related stress assessment

Personal and Ubiquitous Computing
Ten recent trends in computational paralinguistics

COST'11 Proceedings of the 2011 international conference on Cognitive Behavioural Systems
Employing both gender and emotion cues to enhance speaker identification performance in emotional talking environments

International Journal of Speech Technology
Continuous emotion recognition with phonetic syllables

Speech Communication
Nonlinear dynamics characterization of emotional speech

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this study, modulation spectral features (MSFs) are proposed for the automatic recognition of human affective information from speech. The features are extracted from an auditory-inspired long-term spectro-temporal representation. Obtained using an auditory filterbank and a modulation filterbank for speech analysis, the representation captures both acoustic frequency and temporal modulation frequency components, thereby conveying information that is important for human speech perception but missing from conventional short-term spectral features. On an experiment assessing classification of discrete emotion categories, the MSFs show promising performance in comparison with features that are based on mel-frequency cepstral coefficients and perceptual linear prediction coefficients, two commonly used short-term spectral representations. The MSFs further render a substantial improvement in recognition performance when used to augment prosodic features, which have been extensively used for emotion recognition. Using both types of features, an overall recognition rate of 91.6% is obtained for classifying seven emotion categories. Moreover, in an experiment assessing recognition of continuous emotions, the proposed features in combination with prosodic features attain estimation performance comparable to human evaluation.