Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories
CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Multi-modality web video categorization
Proceedings of the international workshop on Workshop on multimedia information retrieval
Speeded-Up Robust Features (SURF)
Computer Vision and Image Understanding
Speech Processing for Audio Indexing
GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
TubeFiler: an automatic web video categorizer
MM '09 Proceedings of the 17th ACM international conference on Multimedia
CEDD: color and edge directivity descriptor: a compact descriptor for image indexing and retrieval
ICVS'08 Proceedings of the 6th international conference on Computer vision systems
Evaluating Color Descriptors for Object and Scene Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
Content-based video genre classification using multiple cues
Proceedings of the 3rd international workshop on Automated information extraction in media production
Automatic tagging and geotagging in video collections and communities
Proceedings of the 1st ACM International Conference on Multimedia Retrieval
SBNMA '11 Proceedings of the 2011 ACM workshop on Social and behavioural networked media access
Content-based video description for automatic video genre categorization
MMM'12 Proceedings of the 18th international conference on Advances in Multimedia Modeling
Automatic Video Classification: A Survey of the Literature
IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Intent and its discontents: the user at the wheel of the online video search engine
Proceedings of the 20th ACM international conference on Multimedia
Hi-index | 0.00 |
This paper describes the possibilities of cross-modal classification of multimedia documents in social media platforms. Our framework predicts the user-chosen category of consumer-produced video sequences based on their textual and visual features. These text resources---includes metadata and automatic speech recognition transcripts---are represented as bags of words and the video content is represented as a bag of clustered local visual features. The contribution of the different modalities is investigated and how they should be combined if sequences lack certain resources. Therefore, several classification methods are evaluated, varying the resources. The paper shows an approach that achieves a mean average precision of 0.3977 using user-contributed metadata in combination with clustered SURF.