Recognizing Action at a Distance
ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Distinctive Image Features from Scale-Invariant Keypoints
International Journal of Computer Vision
Histograms of Oriented Gradients for Human Detection
CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Efficient Visual Event Detection Using Volumetric Features
ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
Behavior recognition via sparse spatio-temporal features
ICCCN '05 Proceedings of the 14th International Conference on Computer Communications and Networks
Towards optimal bag-of-features for object categorization and semantic video retrieval
Proceedings of the 6th ACM international conference on Image and video retrieval
A 3-dimensional sift descriptor and its application to action recognition
Proceedings of the 15th international conference on Multimedia
Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment
IEEE Transactions on Pattern Analysis and Machine Intelligence
SIFT-Bag kernel for video event analysis
MM '08 Proceedings of the 16th ACM international conference on Multimedia
Video event detection using motion relativity and visual relatedness
MM '08 Proceedings of the 16th ACM international conference on Multimedia
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
Modulating Shape Features by Color Attention for Object Recognition
International Journal of Computer Vision
Fusing color and shape for bag-of-words based object recognition
CCIW'13 Proceedings of the 4th international conference on Computational Color Imaging
Hi-index | 0.00 |
How to fuse static and dynamic information is a key issue in event analysis. In this paper, we present a novel approach to combine appearance and motion information together through a top-down manner for event recognition in real videos. Unlike the conventional bottom-up way, attention can be focused volitionally on top-down signals derived from task demands. A video is represented by a collection of spatio-temporal features, called video words by quantizing the extracted spatio-temporal interest points (STIPs) from the video. We propose two approaches to build class specific visual or motion histograms for the corresponding features. One is using the probability of a class given a visual or motion word. High probability means more attention should be paid to this word. Moreover, in order to incorporate the negative information for each word, we propose to utilize the mutual information between each word and event label. High mutual information means high relevance between this word and the class label. Both methods not only can characterize two aspects of an event, but also can select the relevant words, which are all discriminative to the corresponding event. Experimental results on the TRECVID 2005 and the HOHA video corpus demonstrate that the mean average precision has been improved by using the proposed method.