A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
A Generalized Representer Theorem
COLT '01/EuroCOLT '01 Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory
ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Recognizing Human Actions: A Local SVM Approach
ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
A Bayesian Hierarchical Model for Learning Natural Scene Categories
CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories
CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Evaluation campaigns and TRECVid
MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
A dual coordinate descent method for large-scale linear SVM
Proceedings of the 25th international conference on Machine learning
An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector
ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part II
The Pascal Visual Object Classes (VOC) Challenge
International Journal of Computer Vision
Improving the fisher kernel for large-scale image classification
ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Image classification using super-vector coding of local image descriptors
ECCV'10 Proceedings of the 11th European conference on Computer vision: Part V
Hi-index | 0.00 |
We present a generic event detection system evaluated in the Surveillance Event Detection (SED) task of TRECVID 2012. We investigate a statistical approach with spatio-temporal features applied to seven event classes, which were defined by the SED task. This approach is based on local spatio-temporal descriptors, called MoSIFT and generated by pair-wise video frames. A Gaussian Mixture Model(GMM) is learned to model the distribution of the low level features. Then for each sliding window, the Fisher vector encoding [improvedFV] is used to generate the sample representation. The model is learnt using a Linear SVM for each event. The main novelty of our system is the introduction of Fisher vector encoding into video event detection. Fisher vector encoding has demonstrated great success in image classification. The key idea is to model the low level visual features as a Gaussian Mixture Model and to generate an intermediate vector representation for bag of features. FV encoding uses higher order statistics in place of histograms in the standard BoW. FV has several good properties: (a) it can naturally separate the video specific information from the noisy local features and (b) we can use a linear model for this representation. We build an efficient implementation for FV encoding which can attain a 10 times speed-up over real-time. We also take advantage of non-trivial object localization techniques to feed into the video event detection, e.g. multi-scale detection and non-maximum suppression. This approach outperformed the results of all other teams submissions in TRECVID SED 2012 on four of the seven event types.