A framework for multiple-instance learning
NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Learning Patterns of Activity Using Real-Time Tracking
IEEE Transactions on Pattern Analysis and Machine Intelligence
Unsupervised Segmentation of Color-Texture Regions in Images and Video
IEEE Transactions on Pattern Analysis and Machine Intelligence
Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications
ECCV '96 Proceedings of the 4th European Conference on Computer Vision-Volume II - Volume II
A Graphical Model for Audiovisual Object Tracking
IEEE Transactions on Pattern Analysis and Machine Intelligence
Video retrieval using spatio-temporal descriptors
MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
International Journal of Computer Vision - Special Issue on Content-Based Image Retrieval
Distinctive Image Features from Scale-Invariant Keypoints
International Journal of Computer Vision
Image Categorization by Learning and Reasoning with Regions
The Journal of Machine Learning Research
Kernel-Based Bayesian Filtering for Object Tracking
CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Histograms of Oriented Gradients for Human Detection
CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features
ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Audio-visual speech recognition using lip information extracted from side-face images
EURASIP Journal on Audio, Speech, and Music Processing
Kodak's consumer video benchmark data set: concept definition and annotation
Proceedings of the international workshop on Workshop on multimedia information retrieval
Large-scale multimodal semantic concept detection for consumer video
Proceedings of the international workshop on Workshop on multimedia information retrieval
Large head movement tracking using sift-based registration
Proceedings of the 15th international conference on Multimedia
Extracting Moving People from Internet Videos
ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part IV
Object tracking using SIFT features and mean shift
Computer Vision and Image Understanding
Learning semantic scene models by trajectory analysis
ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part III
Audio-Visual Event Recognition in Surveillance Video Sequences
IEEE Transactions on Multimedia
Robust online appearance models for visual tracking
IEEE Transactions on Pattern Analysis and Machine Intelligence
Concept detector refinement using social videos
Proceedings of the international workshop on Very-large-scale multimedia corpus, mining and retrieval
Parametric time-frequency analysis and its applications in music classification
EURASIP Journal on Advances in Signal Processing
ShotTagger: tag location for internet videos
Proceedings of the 1st ACM International Conference on Multimedia Retrieval
Audio-visual fusion using bayesian model combination for web video retrieval
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Multimodal video concept detection via bag of auditory words and multiple kernel learning
MMM'12 Proceedings of the 18th international conference on Advances in Multimedia Modeling
A novel multi-modal integration and propagation model for cross-media information retrieval
MMM'12 Proceedings of the 18th international conference on Advances in Multimedia Modeling
Joint audio-visual bi-modal codewords for video event detection
Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Audio-visual robot command recognition: D-META'12 grand challenge
Proceedings of the 14th ACM international conference on Multimodal interaction
Discovering joint audio---visual codewords for video event detection
Machine Vision and Applications
Hi-index | 0.00 |
We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at semantic concept detection. We propose to extract a novel representation, the Short-term Audio-Visual Atom (S-AVA), for improved concept detection. An S-AVA is defined as a short-term region track associated with regional visual features and background audio features. An effective algorithm, named Short-Term Region tracking with joint Point Tracking and Region Segmentation (STR-PTRS), is developed to extract S-AVAs from generic videos under challenging conditions such as uneven lighting, clutter, occlusions, and complicated motions of both objects and camera. Discriminative audio-visual codebooks are constructed on top of S-AVAs using Multiple Instance Learning. Codebook-based features are generated for semantic concept detection. We extensively evaluate our algorithm over Kodak's consumer benchmark video set from real users. Experimental results confirm significant performance improvements - over 120% MAP gain compared to alternative approaches using static region segmentation without temporal tracking. The joint audio-visual features also outperform visual features alone by an average of 8.5% (in terms of AP) over 21 concepts, with many concepts achieving more than 20%.