Multimodal video concept detection via bag of auditory words and multiple kernel learning

Authors:
Markus Mühling;Ralph Ewerth;Jun Zhou;Bernd Freisleben
Affiliations:
Department of Mathematics & Computer Science, University of Marburg, Marburg, Germany;Department of Mathematics & Computer Science, University of Marburg, Marburg, Germany;Department of Mathematics & Computer Science, University of Marburg, Marburg, Germany;Department of Mathematics & Computer Science, University of Marburg, Marburg, Germany
Venue:
MMM'12 Proceedings of the 18th international conference on Advances in Multimedia Modeling
Year:
2012

Citing 13
Cited 0

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Evaluation campaigns and TRECVid

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
The challenge problem for automated detection of 101 semantic concepts in multimedia

MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
Towards optimal bag-of-features for object categorization and semantic video retrieval

Proceedings of the 6th ACM international conference on Image and video retrieval
How many high-level concepts will fill the semantic gap in news video retrieval?

Proceedings of the 6th ACM international conference on Image and video retrieval
Short-term audio-visual atoms for generic video concept classification

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Semantic concept annotation based on audio PLSA model

MM '09 Proceedings of the 17th ACM international conference on Multimedia
The SHOGUN Machine Learning Toolbox

The Journal of Machine Learning Research
Vlfeat: an open and portable library of computer vision algorithms

Proceedings of the international conference on Multimedia
High-Level Feature Extraction Using SIFT GMMs and Audio Models

ICPR '10 Proceedings of the 2010 20th International Conference on Pattern Recognition
Audio Keywords Discovery for Text-Like Audio Content Analysis and Retrieval

IEEE Transactions on Multimedia
Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study

IEEE Transactions on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

State-of-the-art systems for video concept detection mainly rely on visual features. Some previous approaches have also included audio features, either using low-level features such as mel-frequency cepstral coefficients (MFCC) or exploiting the detection of specific audio concepts. In this paper, we investigate a bag of auditory words (BoAW) approach that models MFCC features in an auditory vocabulary. The resulting BoAW features are combined with state-of-the-art visual features via multiple kernel learning (MKL). Experiments on a large set of 101 video concepts from the MediaMill Challenge show the effectiveness of using BoAW features: The system using BoAW features and a support vector machine with a χ 2-kernel is superior to a state-of-the-art audio approach relying on probabilistic latent semantic indexing. Furthermore, it is shown that an early fusion approach degrades detection performance, whereas the combination of auditory and visual bag of words features via MKL yields a relative performance improvement of 9%.