Corpus-guided sentence generation of natural images
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unsupervised metric fusion by cross diffusion
CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Midge: generating image descriptions from computer vision detections
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Collective generation of natural image descriptions
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Hi-index | 0.00 |
With the overwhelming amounts of visual contents on the Internet nowadays, it is very important to generate meaningful and succinct descriptions of multimedia contents including images and videos. Although human taggings and annotations can partially label some of the images or videos, it is impossible to exhaustively describe all the multimedia data due to its huge scale. Therefore, the key to this important task is to develop an effective algorithm that can automatically generate a description of an image or a frame. In this paper, we propose a multimodal feature fusion framework which can model any given image-description pair using semantically meaningful features. This framework is trained as a combination of multi-modal deep networks having two integral components: An ensemble of image descriptors and a recursive bigram encoder with fixed length output feature vector. These two components are then integrated into a joint model characterizing the correlations between images and texts. The proposed framework can not only model the unique characteristics of images or texts, but also take into account their correlations at the semantic level. Experiments on real image-text data sets show that the proposed framework is effective and efficient in indexing and retrieving semantically similar pairs, which will be very useful to help people locate interesting images or videos in large-scale databases.