SUPER: towards real-time event recognition in internet videos

Authors:
Yu-Gang Jiang
Affiliations:
Fudan University, Shanghai, China
Venue:
Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Year:
2012

Citing 16
Cited 3

Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

International Journal of Computer Vision
Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns

IEEE Transactions on Pattern Analysis and Machine Intelligence
Scale & Affine Invariant Interest Point Detectors

International Journal of Computer Vision
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
On Space-Time Interest Points

International Journal of Computer Vision
Scalable Recognition with a Vocabulary Tree

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Towards optimal bag-of-features for object categorization and semantic video retrieval

Proceedings of the 6th ACM international conference on Image and video retrieval
Speeded-Up Robust Features (SURF)

Computer Vision and Image Understanding
Randomized Clustering Forests for Image Classification

IEEE Transactions on Pattern Analysis and Machine Intelligence
80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Audio-based semantic concept classification for consumer video

IEEE Transactions on Audio, Speech, and Language Processing
Improving the fisher kernel for large-scale image classification

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Hough transform and 3D SURF for robust three dimensional classification

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part VI
Consumer video understanding: a benchmark database and an evaluation of human and machine performance

Proceedings of the 1st ACM International Conference on Multimedia Retrieval
Sampling strategies for bag-of-features image classification

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part IV
Real-Time Visual Concept Classification

IEEE Transactions on Multimedia

A fast video event recognition system and its application to video search

Proceedings of the 20th ACM international conference on Multimedia
Recommendations for video event recognition using concept vocabularies

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Evaluating multimedia features and fusion for example-based event detection

Machine Vision and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Event recognition in unconstrained Internet videos has great potential in many applications. State-of-the-art systems usually include modules that need extensive computation, such as the extraction of spatial-temporal interest points, which poses a big challenge for large-scale video processing. This paper presents SUPER, a Speeded UP Event Recognition framework for efficient Internet video analysis. We take a multimodal baseline that has produced strong performance on popular benchmarks, and systematically evaluate each component in terms of both computational cost and contribution to recognition accuracy. We show that, by choosing suitable features, classifiers, and fusion strategies, recognition speed can be greatly improved with minor performance degradation. In addition, we also evaluate how many visual and audio frames are needed for event recognition in Internet videos, a question left unanswered in the literature. Results on a rigorously designed dataset indicate that similar recognition accuracy can be attained using only 14 frames per video on average. We also observe that, different from the visual channel, the soundtracks contains little redundant information for video event recognition. Integrating all the findings, our suggested SUPER framework is 220-fold faster than the baseline approach with merely 3.8% drop in recognition accuracy. It classifies an 80-second video sequence using models of 20 classes in just 4.56 seconds.