Multimodal video concept detection via bag of auditory words and multiple kernel learning

  • Authors:
  • Markus Mühling;Ralph Ewerth;Jun Zhou;Bernd Freisleben

  • Affiliations:
  • Department of Mathematics & Computer Science, University of Marburg, Marburg, Germany;Department of Mathematics & Computer Science, University of Marburg, Marburg, Germany;Department of Mathematics & Computer Science, University of Marburg, Marburg, Germany;Department of Mathematics & Computer Science, University of Marburg, Marburg, Germany

  • Venue:
  • MMM'12 Proceedings of the 18th international conference on Advances in Multimedia Modeling
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

State-of-the-art systems for video concept detection mainly rely on visual features. Some previous approaches have also included audio features, either using low-level features such as mel-frequency cepstral coefficients (MFCC) or exploiting the detection of specific audio concepts. In this paper, we investigate a bag of auditory words (BoAW) approach that models MFCC features in an auditory vocabulary. The resulting BoAW features are combined with state-of-the-art visual features via multiple kernel learning (MKL). Experiments on a large set of 101 video concepts from the MediaMill Challenge show the effectiveness of using BoAW features: The system using BoAW features and a support vector machine with a χ 2-kernel is superior to a state-of-the-art audio approach relying on probabilistic latent semantic indexing. Furthermore, it is shown that an early fusion approach degrades detection performance, whereas the combination of auditory and visual bag of words features via MKL yields a relative performance improvement of 9%.