Large-scale multimodal semantic concept detection for consumer video

Authors:
Shih-Fu Chang;Dan Ellis;Wei Jiang;Keansub Lee;Akira Yanagawa;Alexander C. Loui;Jiebo Luo
Affiliations:
Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;Eastman Kodak Company, Rochester, NY;Eastman Kodak Company, Rochester, NY
Venue:
Proceedings of the international workshop on Workshop on multimedia information retrieval
Year:
2007

Citing 6
Cited 18

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Object Recognition from Local Scale-Invariant Features

ICCV '99 Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Kodak's consumer video benchmark data set: concept definition and annotation

Proceedings of the international workshop on Workshop on multimedia information retrieval
Sharing features: efficient boosting procedures for multiclass object detection

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Inferring generic activities and events from image content and bags of geo-tags

CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
A comparison of color features for visual concept classification

CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
Experiments in interactive video search by addition and subtraction

CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
Flickr distance

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Event recognition: viewing the world with a third eye

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Personalized video adaptation based on video content analysis

Proceedings of the 9th International Workshop on Multimedia Data Mining: held in conjunction with the ACM SIGKDD 2008
Effective semantic classification of consumer events for automatic content management

WSM '09 Proceedings of the first SIGMM workshop on Social media
Short-term audio-visual atoms for generic video concept classification

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Using large-scale web data to facilitate textual query based retrieval of consumer photos

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Learning automatic concept detectors from online video

Computer Vision and Image Understanding
Audio-visual atoms for generic video concept classification

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Audio-based semantic concept classification for consumer video

IEEE Transactions on Audio, Speech, and Language Processing
Audio-visual fusion using bayesian model combination for web video retrieval

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Audio-visual grouplet: temporal audio-visual interactions for general video concept classification

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Laplacian adaptive context-based SVM for video concept detection

WSM '11 Proceedings of the 3rd ACM SIGMM international workshop on Social media
Features with feelings: incorporating user preferences in video categorization

ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part III
Consumer video dataset with marked head trajectories

Proceedings of the 4th ACM Multimedia Systems Conference
Multiple feature fusion based on co-training approach and time regularization for place classification in wearable video

Advances in Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present a systematic study of automatic classification of consumer videos into a large set of diverse semantic concept classes, which have been carefully selected based on user studies and extensively annotated over 1300+ videos from real users. Our goals are to assess the state of the art of multimedia analytics (including both audio and visual analysis) in consumer video classification and to discover new research opportunities. We investigated several statistical approaches built upon global/local visual features, audio features, and audio-visual combinations. Three multi-modal fusion frameworks (ensemble, context fusion, and joint boosting) are also evaluated. Experiment results show that visual and audio models perform best for different sets of concepts. Both provide significant contributions to multimodal fusion, via expansion of the classifier pool for context fusion and the feature bases for feature sharing. The fused multimodal models are shown to significantly reduce the detection errors (compared to single modality models), resulting in a promising accuracy of 83% over diverse concepts. To the best of our knowledge, this is the first work on systematic investigation of multimodal classification using a large-scale ontology and realistic video corpus.