Semantic Model Vectors for Complex Video Event Recognition

Authors:
Michele Merler;Bert Huang;Lexing Xie;Gang Hua;Apostol Natsev
Affiliations:
Department of Computer Science, Columbia University, New York, NY, USA;Department of Computer Science, University of Maryland, College Park, MD, USA;Research School of Computer Science, The Australian National University, Canberra, Australia;Department of Computer Science, Stevens Institute of Technology, Hoboken, NJ, USA;IBM TJ Watson Research Center, Hawthorne, NY, USA
Venue:
IEEE Transactions on Multimedia
Year:
2012

Citing 0
Cited 10

Submodular video hashing: a unified framework towards video pooling and indexing

Proceedings of the 20th ACM international conference on Multimedia
Scene aligned pooling for complex video recognition

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part II
Recognizing complex events using large margin joint low-level event model

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
A survey of video datasets for human action and activity recognition

Computer Vision and Image Understanding
Recommendations for video event recognition using concept vocabularies

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Searching informative concept banks for video event detection

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Querying for video events by semantic signatures from few examples

Proceedings of the 21st ACM international conference on Multimedia
Semantic pooling for complex event detection

Proceedings of the 21st ACM international conference on Multimedia
Video2Sentence and vice versa

Proceedings of the 21st ACM international conference on Multimedia
Evaluating multimedia features and fusion for example-based event detection

Machine Vision and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose semantic model vectors, an intermediate level semantic representation, as a basis for modeling and detecting complex events in unconstrained real-world videos, such as those from YouTube. The semantic model vectors are extracted using a set of discriminative semantic classifiers, each being an ensemble of SVM models trained from thousands of labeled web images, for a total of 280 generic concepts. Our study reveals that the proposed semantic model vectors representation outperforms—and is complementary to—other low-level visual descriptors for video event modeling. We hence present an end-to-end video event detection system, which combines semantic model vectors with other static or dynamic visual descriptors, extracted at the frame, segment, or full clip level. We perform a comprehensive empirical study on the 2010 TRECVID Multimedia Event Detection task (http://www.nist.gov/itl/iad/mig/med10.cfm), which validates the semantic model vectors representation not only as the best individual descriptor, outperforming state-of-the-art global and local static features as well as spatio-temporal HOG and HOF descriptors, but also as the most compact. We also study early and late feature fusion across the various approaches, leading to a 15% performance boost and an overall system performance of 0.46 mean average precision. In order to promote further research in this direction, we made our semantic model vectors for the TRECVID MED 2010 set publicly available for the community to use (http://www1.cs.columbia.edu/~mmerler/SMV.html).