Violence detection in hollywood movies by the fusion of visual and mid-level audio cues

Authors:
Esra Acar;Frank Hopfgartner;Sahin Albayrak
Affiliations:
DAI Laboratory, Technische Universitat Berlin, Berlin, Germany;DAI Laboratory, Technische Universitat Berlin, Berlin, Germany;DAI Laboratory, Technische Universitat Berlin, Berlin, Germany
Venue:
Proceedings of the 21st ACM international conference on Multimedia
Year:
2013

Citing 7
Cited 0

Fundamentals of speech recognition

Fundamentals of speech recognition
Semi-Supervised Cross Feature Learning for Semantic Concept Detection in Videos

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Detecting Violent Scenes in Movies by Auditory and Visual Cues

PCM '08 Proceedings of the 9th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Learning from Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
Weakly-Supervised Violence Detection in Movies with Audio and Video Based Co-training

PCM '09 Proceedings of the 10th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Audio-Visual fusion for detecting violent scenes in videos

SETN'10 Proceedings of the 6th Hellenic conference on Artificial Intelligence: theories, models and applications
A benchmarking campaign for the multimodal detection of violent scenes in movies

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part III

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detecting violent scenes in movies is an important video content understanding functionality e.g., for providing automated youth protection services. One key issue in designing algorithms for violence detection is the choice of discriminative features. In this paper, we employ mid-level audio features and compare their discriminative power against low-level audio and visual features. We fuse these mid-level audio cues with low-level visual ones at the decision level in order to further improve the performance of violence detection. We use Mel-Frequency Cepstral Coefficients (MFCC) as audio and average motion as visual features. In order to learn a violence model, we choose two-class support vector machines (SVMs). Our experimental results on detecting violent video shots in Hollywood movies show that mid-level audio features are more discriminative and provide more precise results than low-level ones. The detection performance is further enhanced by fusing the mid-level audio cues with low-level visual ones using an SVM-based decision fusion.