Submodular video hashing: a unified framework towards video pooling and indexing
Proceedings of the 20th ACM international conference on Multimedia
Scene aligned pooling for complex video recognition
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part II
Recognizing complex events using large margin joint low-level event model
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
A survey of video datasets for human action and activity recognition
Computer Vision and Image Understanding
Recommendations for video event recognition using concept vocabularies
Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Searching informative concept banks for video event detection
Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Querying for video events by semantic signatures from few examples
Proceedings of the 21st ACM international conference on Multimedia
Semantic pooling for complex event detection
Proceedings of the 21st ACM international conference on Multimedia
Proceedings of the 21st ACM international conference on Multimedia
Evaluating multimedia features and fusion for example-based event detection
Machine Vision and Applications
Hi-index | 0.00 |
We propose semantic model vectors, an intermediate level semantic representation, as a basis for modeling and detecting complex events in unconstrained real-world videos, such as those from YouTube. The semantic model vectors are extracted using a set of discriminative semantic classifiers, each being an ensemble of SVM models trained from thousands of labeled web images, for a total of 280 generic concepts. Our study reveals that the proposed semantic model vectors representation outperforms—and is complementary to—other low-level visual descriptors for video event modeling. We hence present an end-to-end video event detection system, which combines semantic model vectors with other static or dynamic visual descriptors, extracted at the frame, segment, or full clip level. We perform a comprehensive empirical study on the 2010 TRECVID Multimedia Event Detection task (http://www.nist.gov/itl/iad/mig/med10.cfm), which validates the semantic model vectors representation not only as the best individual descriptor, outperforming state-of-the-art global and local static features as well as spatio-temporal HOG and HOF descriptors, but also as the most compact. We also study early and late feature fusion across the various approaches, leading to a 15% performance boost and an overall system performance of 0.46 mean average precision. In order to promote further research in this direction, we made our semantic model vectors for the TRECVID MED 2010 set publicly available for the community to use (http://www1.cs.columbia.edu/~mmerler/SMV.html).