Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Object Recognition from Local Scale-Invariant Features
ICCV '99 Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories
CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Kodak's consumer video benchmark data set: concept definition and annotation
Proceedings of the international workshop on Workshop on multimedia information retrieval
Sharing features: efficient boosting procedures for multiclass object detection
CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
Inferring generic activities and events from image content and bags of geo-tags
CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
A comparison of color features for visual concept classification
CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
Experiments in interactive video search by addition and subtraction
CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
MM '08 Proceedings of the 16th ACM international conference on Multimedia
Event recognition: viewing the world with a third eye
MM '08 Proceedings of the 16th ACM international conference on Multimedia
Personalized video adaptation based on video content analysis
Proceedings of the 9th International Workshop on Multimedia Data Mining: held in conjunction with the ACM SIGKDD 2008
Effective semantic classification of consumer events for automatic content management
WSM '09 Proceedings of the first SIGMM workshop on Social media
Short-term audio-visual atoms for generic video concept classification
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Using large-scale web data to facilitate textual query based retrieval of consumer photos
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Learning automatic concept detectors from online video
Computer Vision and Image Understanding
Audio-visual atoms for generic video concept classification
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Audio-based semantic concept classification for consumer video
IEEE Transactions on Audio, Speech, and Language Processing
Audio-visual fusion using bayesian model combination for web video retrieval
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Audio-visual grouplet: temporal audio-visual interactions for general video concept classification
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Laplacian adaptive context-based SVM for video concept detection
WSM '11 Proceedings of the 3rd ACM SIGMM international workshop on Social media
Features with feelings: incorporating user preferences in video categorization
ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part III
Consumer video dataset with marked head trajectories
Proceedings of the 4th ACM Multimedia Systems Conference
Hi-index | 0.00 |
In this paper we present a systematic study of automatic classification of consumer videos into a large set of diverse semantic concept classes, which have been carefully selected based on user studies and extensively annotated over 1300+ videos from real users. Our goals are to assess the state of the art of multimedia analytics (including both audio and visual analysis) in consumer video classification and to discover new research opportunities. We investigated several statistical approaches built upon global/local visual features, audio features, and audio-visual combinations. Three multi-modal fusion frameworks (ensemble, context fusion, and joint boosting) are also evaluated. Experiment results show that visual and audio models perform best for different sets of concepts. Both provide significant contributions to multimodal fusion, via expansion of the classifier pool for context fusion and the feature bases for feature sharing. The fused multimodal models are shown to significantly reduce the detection errors (compared to single modality models), resulting in a promising accuracy of 83% over diverse concepts. To the best of our knowledge, this is the first work on systematic investigation of multimodal classification using a large-scale ontology and realistic video corpus.