Co-clustering documents and words using bipartite spectral graph partitioning
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
A Graphical Model for Audiovisual Object Tracking
IEEE Transactions on Pattern Analysis and Machine Intelligence
Scale & Affine Invariant Interest Point Detectors
International Journal of Computer Vision
Distinctive Image Features from Scale-Invariant Keypoints
International Journal of Computer Vision
International Journal of Computer Vision
Introduction to Information Retrieval
Introduction to Information Retrieval
Short-term audio-visual atoms for generic video concept classification
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Proceedings of the 1st ACM International Conference on Multimedia Retrieval
Audio-visual grouplet: temporal audio-visual interactions for general video concept classification
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Audio-Visual Event Recognition in Surveillance Video Sequences
IEEE Transactions on Multimedia
Content-Based Multimedia Retrieval Using Feature Correlation Clustering and Fusion
International Journal of Multimedia Data Engineering & Management
Human interaction categorization by using audio-visual cues
Machine Vision and Applications
Discovering joint audio---visual codewords for video event detection
Machine Vision and Applications
Hi-index | 0.00 |
Joint audio-visual patterns often exist in videos and provide strong multi-modal cues for detecting multimedia events. However, conventional methods generally fuse the visual and audio information only at a superficial level, without adequately exploring deep intrinsic joint patterns. In this paper, we propose a joint audio-visual bi-modal representation, called bi-modal words. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitioning over the bipartite graph is then applied to construct the bi-modal words that reveal the joint patterns across modalities. Finally, different pooling strategies are employed to re-quantize the visual and audio words into the bi-modal words and form bi-modal Bag-of-Words representations that are fed to subsequent multimedia event classifiers. We experimentally show that the proposed multi-modal feature achieves statistically significant performance gains over methods using individual visual and audio features alone and alternative multi-modal fusion methods. Moreover, we found that average pooling is the most suitable strategy for bi-modal feature generation.