Multimodal feature fusion for robust event detection in web videos

Authors:
Premkumar Natarajan
Affiliations:
Speech, Language and Multimedia Business Unit, Raytheon BBN Technologies, Cambridge, MA 02138
Venue:
CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Year:
2012

Citing 0
Cited 10

Recommendations for video event recognition using concept vocabularies

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Searching informative concept banks for video event detection

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
We are not equally negative: fine-grained labeling for multimedia event detection

Proceedings of the 21st ACM international conference on Multimedia
Querying for video events by semantic signatures from few examples

Proceedings of the 21st ACM international conference on Multimedia
Segmental multi-way local pooling for video recognition

Proceedings of the 21st ACM international conference on Multimedia
Semantic pooling for complex event detection

Proceedings of the 21st ACM international conference on Multimedia
Multi-Max-Margin Support Vector Machine for multi-source human action recognition

Neurocomputing
Evaluating multimedia features and fusion for example-based event detection

Machine Vision and Applications
E-LAMP: integration of innovative ideas for multimedia event detection

Machine Vision and Applications
Multimedia event detection with multimodal feature fusion and temporal concept localization

Machine Vision and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Combining multiple low-level visual features is a proven and effective strategy for a range of computer vision tasks. However, limited attention has been paid to combining such features with information from other modalities, such as audio and videotext, for large scale analysis of web videos. In our work, we rigorously analyze and combine a large set of low-level features that capture appearance, color, motion, audio and audio-visual co-occurrence patterns in videos. We also evaluate the utility of high-level (i.e., semantic) visual information obtained from detecting scene, object, and action concepts. Further, we exploit multimodal information by analyzing available spoken and videotext content using state-of-the-art automatic speech recognition (ASR) and videotext recognition systems. We combine these diverse features using a two-step strategy employing multiple kernel learning (MKL) and late score level fusion methods. Based on the TRECVID MED 2011 evaluations for detecting 10 events in a large benchmark set of ∼45000 videos, our system showed the best performance among the 19 international teams.