Fundamentals of speech recognition
Fundamentals of speech recognition
Topographic ICA as a Model of V1 Receptive Fields
IJCNN '00 Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00)-Volume 4 - Volume 4
ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Distinctive Image Features from Scale-Invariant Keypoints
International Journal of Computer Vision
Early versus late fusion in semantic video analysis
Proceedings of the 13th annual ACM international conference on Multimedia
Early versus late fusion in semantic video analysis
Proceedings of the 13th annual ACM international conference on Multimedia
A fast learning algorithm for deep belief nets
Neural Computation
Kodak's consumer video benchmark data set: concept definition and annotation
Proceedings of the international workshop on Workshop on multimedia information retrieval
IEEE Transactions on Pattern Analysis and Machine Intelligence
Convolutional learning of spatio-temporal features
ECCV'10 Proceedings of the 11th European conference on Computer vision: Part VI
Automatic Concept Detector Refinement for Large-Scale Video Semantic Annotation
ICSC '10 Proceedings of the 2010 IEEE Fourth International Conference on Semantic Computing
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
Action recognition by dense trajectories
CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Concept-Driven Multi-Modality Fusion for Video Search
IEEE Transactions on Circuits and Systems for Video Technology
We are not equally negative: fine-grained labeling for multimedia event detection
Proceedings of the 21st ACM international conference on Multimedia
Max-Margin Early Event Detectors
International Journal of Computer Vision
Hi-index | 0.00 |
Automatic event detection in a large collection of unconstrained videos is a challenging and important task. The key issue is to describe long complex video with high level semantic descriptors, which should find the regularity of events in the same category while distinguish those from different categories. This paper proposes a novel unsupervised approach to discover data-driven concepts from multi-modality signals (audio, scene and motion) to describe high level semantics of videos. Our methods consists of three main components: we first learn the low-level features separately from three modalities. Secondly we discover the data-driven concepts based on the statistics of learned features mapped to a low dimensional space using deep belief nets (DBNs). Finally, a compact and robust sparse representation is learned to jointly model the concepts from all three modalities. Extensive experimental results on large in-the-wild dataset show that our proposed method significantly outperforms state-of-the-art methods.