Human emotion recognition from videos using spatio-temporal and audio features

Authors:
Munaf Rashid;S. A. Abu-Bakar;Musa Mokji
Affiliations:
Computer Vision, Video and Image Processing Lab (CVVIP), Faculty of Electrical Engineering, Universiti Teknologi Malaysia, Johor Bahru, Malaysia UTM 81310 and College of Engineering (COE), Karachi ...;Computer Vision, Video and Image Processing Lab (CVVIP), Faculty of Electrical Engineering, Universiti Teknologi Malaysia, Johor Bahru, Malaysia UTM 81310;Computer Vision, Video and Image Processing Lab (CVVIP), Faculty of Electrical Engineering, Universiti Teknologi Malaysia, Johor Bahru, Malaysia UTM 81310
Venue:
The Visual Computer: International Journal of Computer Graphics
Year:
2013

Citing 15
Cited 0

Coding, Analysis, Interpretation, and Recognition of Facial Expressions

IEEE Transactions on Pattern Analysis and Machine Intelligence
Recognizing Facial Expressions in Image Sequences Using Local Parameterized Models of Image Motion

International Journal of Computer Vision
Emotions, speech and the ASR framework

Speech Communication - Special issue on speech and emotion
Subtly Different Facial Expression Recognition and Expression Intensity Estimation

CVPR '98 Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Automated Facial Expression Recognition Based on FACS Action Units

FG '98 Proceedings of the 3rd. International Conference on Face & Gesture Recognition
The eNTERFACE'05 Audio-Visual Emotion Database

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Emotion detection in task-oriented spoken dialogues

ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 3 (ICME '03) - Volume 03
Behavior recognition via sparse spatio-temporal features

ICCCN '05 Proceedings of the 14th International Conference on Computer Communications and Networks
Neural Network Classifier for Human Emotion Recognition from Facial Expressions Using Discrete Cosine Transform

ICETET '08 Proceedings of the 2008 First International Conference on Emerging Trends in Engineering and Technology
Multimodal biometric human recognition for perceptual human-computer interaction

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Emotion recognition using a hierarchical binary decision tree approach

Speech Communication
Relevance feedback for real-world human action retrieval

Pattern Recognition Letters
Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection

IEEE Transactions on Audio, Speech, and Language Processing
Affective Audio-Visual Words and Latent Topic Driving Model for Realizing Movie Affective Scene Classification

IEEE Transactions on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present human emotion recognition systems based on audio and spatio-temporal visual features. The proposed system has been tested on audio visual emotion data set with different subjects for both genders. The mel-frequency cepstral coefficient (MFCC) and prosodic features are first identified and then extracted from emotional speech. For facial expressions spatio-temporal features are extracted from visual streams. Principal component analysis (PCA) is applied for dimensionality reduction of the visual features and capturing 97 % of variances. Codebook is constructed for both audio and visual features using Euclidean space. Then occurrences of the histograms are employed as input to the state-of-the-art SVM classifier to realize the judgment of each classifier. Moreover, the judgments from each classifier are combined using Bayes sum rule (BSR) as a final decision step. The proposed system is tested on public data set to recognize the human emotions. Experimental results and simulations proved that using visual features only yields on average 74.15 % accuracy, while using audio features only gives recognition average accuracy of 67.39 %. Whereas by combining both audio and visual features, the overall system accuracy has been significantly improved up to 80.27 %.