Translating related words to videos and back through latent topics

Authors:
Pradipto Das;Rohini K. Srihari;Jason J. Corso
Affiliations:
SUNY Buffalo, Buffalo, NY, USA;SUNY Buffalo, Buffalo, NY, USA;SUNY Buffalo, Buffalo, NY, USA
Venue:
Proceedings of the sixth ACM international conference on Web search and data mining
Year:
2013

Citing 21
Cited 1

Video parsing, retrieval and browsing: an integrated and content-based solution

Proceedings of the third ACM international conference on Multimedia
Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

International Journal of Computer Vision
Modeling annotated data

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Accurate unlexicalized parsing

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A New Baseline for Image Annotation

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part III
Clustering the tagged web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Graphical Models, Exponential Families, and Variational Inference

Graphical Models, Exponential Families, and Variational Inference
Structured correspondence topic models for mining captioned figures in biological literature

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Human Action Recognition by Semilatent Topic Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Evaluation of GIST descriptors for web-scale image search

Proceedings of the ACM International Conference on Image and Video Retrieval
Topic models for semantics-preserving video compression

Proceedings of the international conference on Multimedia information retrieval
Generating advertising keywords from video content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Efficient object category recognition using classemes

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part I
Efficiently scaling up video annotation with crowdsourced marketplaces

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Multiple Bernoulli relevance models for image and video annotation

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Evaluation of local descriptors for action recognition in videos

ICVS'11 Proceedings of the 8th international conference on Computer vision systems
Corpus-guided sentence generation of natural images

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Variational learning for Gaussian mixture models

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Video2Sentence and vice versa

Proceedings of the 21st ACM international conference on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

Documents containing video and text are becoming more and more widespread and yet content analysis of those documents depends primarily on the text. Although automated discovery of semantically related words from text improves free text query understanding, translating videos into text summaries facilitates better video search particularly in the absence of accompanying text. In this paper, we propose a multimedia topic modeling framework suitable for providing a basis for automatically discovering and translating semantically related words obtained from textual metadata of multimedia documents to semantically related videos or frames from videos. The framework jointly models video and text and is flexible enough to handle different types of document features in their constituent domains such as discrete and real valued features from videos representing actions, objects, colors and scenes as well as discrete features from text. Our proposed models show much better fit to the multimedia data in terms of held-out data log likelihoods. For a given query video, our models translate low level vision features into bag of keyword summaries which can be further translated using simple natural language generation techniques into human readable paragraphs. We quantitatively compare the results of video to bag of words translation against a state-of-the-art baseline object recognition model from computer vision. We show that text translations from multimodal topic models vastly outperform the baseline on a multimedia dataset downloaded from the Internet.