Video parsing, retrieval and browsing: an integrated and content-based solution
Proceedings of the third ACM international conference on Multimedia
Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope
International Journal of Computer Vision
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The Journal of Machine Learning Research
Automatic evaluation of summaries using N-gram co-occurrence statistics
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Accurate unlexicalized parsing
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A New Baseline for Image Annotation
ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part III
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Graphical Models, Exponential Families, and Variational Inference
Graphical Models, Exponential Families, and Variational Inference
Structured correspondence topic models for mining captioned figures in biological literature
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Human Action Recognition by Semilatent Topic Models
IEEE Transactions on Pattern Analysis and Machine Intelligence
Evaluation of GIST descriptors for web-scale image search
Proceedings of the ACM International Conference on Image and Video Retrieval
Topic models for semantics-preserving video compression
Proceedings of the international conference on Multimedia information retrieval
Generating advertising keywords from video content
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Efficient object category recognition using classemes
ECCV'10 Proceedings of the 11th European conference on Computer vision: Part I
Efficiently scaling up video annotation with crowdsourced marketplaces
ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Multiple Bernoulli relevance models for image and video annotation
CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
Evaluation of local descriptors for action recognition in videos
ICVS'11 Proceedings of the 8th international conference on Computer vision systems
Corpus-guided sentence generation of natural images
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Variational learning for Gaussian mixture models
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Proceedings of the 21st ACM international conference on Multimedia
Hi-index | 0.00 |
Documents containing video and text are becoming more and more widespread and yet content analysis of those documents depends primarily on the text. Although automated discovery of semantically related words from text improves free text query understanding, translating videos into text summaries facilitates better video search particularly in the absence of accompanying text. In this paper, we propose a multimedia topic modeling framework suitable for providing a basis for automatically discovering and translating semantically related words obtained from textual metadata of multimedia documents to semantically related videos or frames from videos. The framework jointly models video and text and is flexible enough to handle different types of document features in their constituent domains such as discrete and real valued features from videos representing actions, objects, colors and scenes as well as discrete features from text. Our proposed models show much better fit to the multimedia data in terms of held-out data log likelihoods. For a given query video, our models translate low level vision features into bag of keyword summaries which can be further translated using simple natural language generation techniques into human readable paragraphs. We quantitatively compare the results of video to bag of words translation against a state-of-the-art baseline object recognition model from computer vision. We show that text translations from multimodal topic models vastly outperform the baseline on a multimedia dataset downloaded from the Internet.