Detection bank: an object detection based video representation for multimedia event recognition

Authors:
Tim Althoff;Hyun Oh Song;Trevor Darrell
Affiliations:
UC Berkeley / ICSI, Berkeley, CA, USA;UC Berkeley / ICSI, Berkeley, CA, USA;UC Berkeley / ICSI, Berkeley, CA, USA
Venue:
Proceedings of the 20th ACM international conference on Multimedia
Year:
2012

Citing 6
Cited 3

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Large Scale Multiple Kernel Learning

The Journal of Machine Learning Research
Object Detection with Discriminatively Trained Part-Based Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficient object category recognition using classemes

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part I
Double fusion for multimedia event detection

MMM'12 Proceedings of the 18th international conference on Advances in Multimedia Modeling
Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News

IEEE Transactions on Multimedia

Recommendations for video event recognition using concept vocabularies

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Searching informative concept banks for video event detection

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Evaluating multimedia features and fusion for example-based event detection

Machine Vision and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

While low-level image features have proven to be effective representations for visual recognition tasks such as object recognition and scene classification, they are inadequate to capture complex semantic meaning required to solve high-level visual tasks such as multimedia event detection and recognition. Recognition or retrieval of events and activities can be improved if specific discriminative objects are detected in a video sequence. In this paper, we propose an image representation, called Detection Bank, based on the detection images from a large number of windowed object detectors where an image is represented by different statistics derived from these detections. This representation is extended to video by aggregating the key frame level image representations through mean and max pooling. We empirically show that it captures complementary information to state-of-the-art representations such as Spatial Pyramid Matching and Object Bank. These descriptors combined with our Detection Bank representation significantly outperforms any of the representations alone on TRECVID MED 2011 data.