Reordering video shots for event classification using bag-of-words models and string kernels

Authors:
Yung-Lun Chen;Shyi-Chyi Cheng;Yi-Ping Phoebe Chen
Affiliations:
National Taiwan Ocean Univ., Taiwan;National Taiwan Ocean Univ., Taiwan;La Trobe University, Australia
Venue:
Proceedings of the 27th Conference on Image and Vision Computing New Zealand
Year:
2012

Citing 23
Cited 0

An Efficient k-Means Clustering Algorithm: Analysis and Implementation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Recognizing Human Actions: A Local SVM Approach

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
Evaluation campaigns and TRECVid

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study

International Journal of Computer Vision
Evaluating bag-of-visual-words representations in scene classification

Proceedings of the international workshop on Workshop on multimedia information retrieval
A 3-dimensional sift descriptor and its application to action recognition

Proceedings of the 15th international conference on Multimedia
Local velocity-adapted motion events for spatio-temporal recognition

Computer Vision and Image Understanding
Robust Object Detection with Interleaved Categorization and Segmentation

International Journal of Computer Vision
Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment

IEEE Transactions on Pattern Analysis and Machine Intelligence
SIFT-Bag kernel for video event analysis

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Video event detection using motion relativity and visual relatedness

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Efficient Subwindow Search: A Branch and Bound Framework for Object Localization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Understanding video events: a survey of methods for automatic interpretation of semantic occurrences in video

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning

IEEE Transactions on Pattern Analysis and Machine Intelligence
A survey of graph edit distance

Pattern Analysis & Applications
Semantic Image Retrieval Using Region Based Inverted File

DICTA '09 Proceedings of the 2009 Digital Image Computing: Techniques and Applications
Video event classification using string kernels

Multimedia Tools and Applications
A survey on vision-based human action recognition

Image and Vision Computing
A fast cube-based video shot retrieval using 3D moment-preserving technique

ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
Object Detection with Discriminatively Trained Part-Based Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Event detection and recognition for semantic annotation of video

Multimedia Tools and Applications
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a novel approach to reorder video shots using the state-of-the-art bag-of-words (BoW) approach. The shot reordering approach eliminates the temporal ambiguity which is likely to degrade the performance of conventional video event recognition algorithms using support vector machine (SVM) classifiers with string kernels. A traditional BoW model constructs feature vectors for video frames, regarding the arrangement of the visual words in the 2D image space, to be histograms of visual words which do not consider spatial-temporal information. Our approach first segments the input video clip into a set of video shots where each of them is further divided into multiple three dimensional video patches and cubes. In this paper we present a method to introduce spatial-temporal information into the BoW model by analytically extracting space-time features from individual 3D cubes. The system learns the BoW codebook from these 3D cubes. Every video shot in an input video sequence is represented as a BoW histogram and the corresponding event is then modelled as a sequence of BoW histograms which are further reordered by the proposed normalization scheme. The string kernels for SVM classification are finally adopted to train the SVM classifiers from a set of training samples. These classifiers are used to recognize the event type of a test video clip. Our framework presents a simple and effective way to infuse both temporal and spatial configurations for video events. Results show that the proposed method gives good performance on several publicly available datasets in terms of robustness and recognition rate.