Joint audio-visual bi-modal codewords for video event detection

  • Authors:
  • Guangnan Ye;I-Hong Jhuo;Dong Liu;Yu-Gang Jiang;D. T. Lee;Shih-Fu Chang

  • Affiliations:
  • Columbia University;National Taiwan University;Columbia University;Fudan University;National Taiwan University and National Chung Hsing University;Columbia University

  • Venue:
  • Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Joint audio-visual patterns often exist in videos and provide strong multi-modal cues for detecting multimedia events. However, conventional methods generally fuse the visual and audio information only at a superficial level, without adequately exploring deep intrinsic joint patterns. In this paper, we propose a joint audio-visual bi-modal representation, called bi-modal words. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitioning over the bipartite graph is then applied to construct the bi-modal words that reveal the joint patterns across modalities. Finally, different pooling strategies are employed to re-quantize the visual and audio words into the bi-modal words and form bi-modal Bag-of-Words representations that are fed to subsequent multimedia event classifiers. We experimentally show that the proposed multi-modal feature achieves statistically significant performance gains over methods using individual visual and audio features alone and alternative multi-modal fusion methods. Moreover, we found that average pooling is the most suitable strategy for bi-modal feature generation.