Fundamentals of speech recognition
Fundamentals of speech recognition
Semi-Supervised Cross Feature Learning for Semantic Concept Detection in Videos
CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Detecting Violent Scenes in Movies by Auditory and Visual Cues
PCM '08 Proceedings of the 9th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
IEEE Transactions on Knowledge and Data Engineering
Weakly-Supervised Violence Detection in Movies with Audio and Video Based Co-training
PCM '09 Proceedings of the 10th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Audio-Visual fusion for detecting violent scenes in videos
SETN'10 Proceedings of the 6th Hellenic conference on Artificial Intelligence: theories, models and applications
A benchmarking campaign for the multimodal detection of violent scenes in movies
ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part III
Hi-index | 0.00 |
Detecting violent scenes in movies is an important video content understanding functionality e.g., for providing automated youth protection services. One key issue in designing algorithms for violence detection is the choice of discriminative features. In this paper, we employ mid-level audio features and compare their discriminative power against low-level audio and visual features. We fuse these mid-level audio cues with low-level visual ones at the decision level in order to further improve the performance of violence detection. We use Mel-Frequency Cepstral Coefficients (MFCC) as audio and average motion as visual features. In order to learn a violence model, we choose two-class support vector machines (SVMs). Our experimental results on detecting violent video shots in Hollywood movies show that mid-level audio features are more discriminative and provide more precise results than low-level ones. The detection performance is further enhanced by fusing the mid-level audio cues with low-level visual ones using an SVM-based decision fusion.