Joint audio-visual bi-modal codewords for video event detection

Authors:
Guangnan Ye;I-Hong Jhuo;Dong Liu;Yu-Gang Jiang;D. T. Lee;Shih-Fu Chang
Affiliations:
Columbia University;National Taiwan University;Columbia University;Fudan University;National Taiwan University and National Chung Hsing University;Columbia University
Venue:
Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Year:
2012

Citing 10
Cited 3

Co-clustering documents and words using bipartite spectral graph partitioning

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
A Graphical Model for Audiovisual Object Tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence
Scale & Affine Invariant Interest Point Detectors

International Journal of Computer Vision
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
On Space-Time Interest Points

International Journal of Computer Vision
Introduction to Information Retrieval

Introduction to Information Retrieval
Short-term audio-visual atoms for generic video concept classification

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Consumer video understanding: a benchmark database and an evaluation of human and machine performance

Proceedings of the 1st ACM International Conference on Multimedia Retrieval
Audio-visual grouplet: temporal audio-visual interactions for general video concept classification

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Audio-Visual Event Recognition in Surveillance Video Sequences

IEEE Transactions on Multimedia

Content-Based Multimedia Retrieval Using Feature Correlation Clustering and Fusion

International Journal of Multimedia Data Engineering & Management
Human interaction categorization by using audio-visual cues

Machine Vision and Applications
Discovering joint audio---visual codewords for video event detection

Machine Vision and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Joint audio-visual patterns often exist in videos and provide strong multi-modal cues for detecting multimedia events. However, conventional methods generally fuse the visual and audio information only at a superficial level, without adequately exploring deep intrinsic joint patterns. In this paper, we propose a joint audio-visual bi-modal representation, called bi-modal words. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitioning over the bipartite graph is then applied to construct the bi-modal words that reveal the joint patterns across modalities. Finally, different pooling strategies are employed to re-quantize the visual and audio words into the bi-modal words and form bi-modal Bag-of-Words representations that are fed to subsequent multimedia event classifiers. We experimentally show that the proposed multi-modal feature achieves statistically significant performance gains over methods using individual visual and audio features alone and alternative multi-modal fusion methods. Moreover, we found that average pooling is the most suitable strategy for bi-modal feature generation.