Multimedia event detection with multimodal feature fusion and temporal concept localization

Authors:
Sangmin Oh;Scott Mccloskey;Ilseo Kim;Arash Vahdat;Kevin J. Cannons;Hossein Hajimirsadeghi;Greg Mori;A. G. Perera;Megha Pandey;Jason J. Corso
Affiliations:
Kitware Inc., Clifton Park, USA;Honeywell Labs, Minneapolis, USA;Kitware Inc., Clifton Park, USA;School of Computing Science, Simon Fraser University, Burnaby, Canada;School of Computing Science, Simon Fraser University, Burnaby, Canada;School of Computing Science, Simon Fraser University, Burnaby, Canada;School of Computing Science, Simon Fraser University, Burnaby, Canada;Kitware Inc., Clifton Park, USA;Kitware Inc., Clifton Park, USA;Department of Computer Science and Engineering, SUNY at Buffalo, Buffalo, USA
Venue:
Machine Vision and Applications
Year:
2014

Citing 32
Cited 1

On Combining Classifiers

IEEE Transactions on Pattern Analysis and Machine Intelligence
Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

International Journal of Computer Vision
Modeling annotated data

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A MFoM learning approach to robust multiclass multi-label text categorization

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Multiple kernel learning, conic duality, and the SMO algorithm

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Histograms of Oriented Gradients for Human Detection

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Text classification with kernels on the multinomial manifold

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Predicting good probabilities with supervised learning

ICML '05 Proceedings of the 22nd international conference on Machine learning
Multimedia semantic indexing using model vectors

ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 1
The challenge problem for automated detection of 101 semantic concepts in multimedia

MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
A New Baseline for Image Annotation

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part III
On the importance of modeling temporal information in music tag annotation

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Optimal Classifier Fusion in a Non-Bayesian Probabilistic Framework

IEEE Transactions on Pattern Analysis and Machine Intelligence
Score normalization in multimodal biometric systems

Pattern Recognition
Evaluating Color Descriptors for Object and Scene Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Object Detection with Discriminatively Trained Part-Based Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Topic models for image annotation and text illustration

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Audio-based semantic concept classification for consumer video

IEEE Transactions on Audio, Speech, and Language Processing
Explicit and implicit concept-based video retrieval with bipartite graph propagation model

Proceedings of the international conference on Multimedia
Towards a universal detector by mining concepts with small semantic gaps

Proceedings of the international conference on Multimedia
What does classifying more than 10,000 image categories tell us?

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part V
Robust fusion: extreme value theory for recognition score normalization

ECCV'10 Proceedings of the 11th European conference on computer vision conference on Computer vision: Part III
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Audio-visual grouplet: temporal audio-visual interactions for general video concept classification

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Double fusion for multimedia event detection

MMM'12 Proceedings of the 18th international conference on Advances in Multimedia Modeling
Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
A Survey on Visual Content-Based Video Indexing and Retrieval

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Multimodal feature fusion for robust event detection in web videos

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Linear dependency modeling for feature fusion

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Leveraging high-level and low-level features for multimedia event detection

Proceedings of the 20th ACM international conference on Multimedia
Local expert forest of score fusion for video event classification

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part V
Explicit performance metric optimization for fusion-based video retrieval

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part III

Special issue on Multimedia Event Detection

Machine Vision and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a system for multimedia event detection. The developed system characterizes complex multimedia events based on a large array of multimodal features, and classifies unseen videos by effectively fusing diverse responses. We present three major technical innovations. First, we explore novel visual and audio features across multiple semantic granularities, including building, often in an unsupervised manner, mid-level and high-level features upon low-level features to enable semantic understanding. Second, we show a novel Latent SVM model which learns and localizes discriminative high-level concepts in cluttered video sequences. In addition to improving detection accuracy beyond existing approaches, it enables a unique summary for every retrieval by its use of high-level concepts and temporal evidence localization. The resulting summary provides some transparency into why the system classified the video as it did. Finally, we present novel fusion learning algorithms and our methodology to improve fusion learning under limited training data condition. Thorough evaluation on a large TRECVID MED 2011 dataset showcases the benefits of the presented system.