A multimodal framework for unsupervised feature fusion

Authors:
Xiaoyi Li;Jing Gao;Hui Li;Le Yang;Rohini K. Srihari
Affiliations:
SUNY at Buffalo, buffalo, New York, USA;SUNY at Buffalo, buffalo, New York, USA;SUNY at Buffalo, buffalo, New York, USA;SUNY at Buffalo, buffalo, New York, USA;SUNY at Buffalo, buffalo, New York, USA
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 4
Cited 0

Corpus-guided sentence generation of natural images

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unsupervised metric fusion by cross diffusion

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Midge: generating image descriptions from computer vision detections

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Collective generation of natural image descriptions

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the overwhelming amounts of visual contents on the Internet nowadays, it is very important to generate meaningful and succinct descriptions of multimedia contents including images and videos. Although human taggings and annotations can partially label some of the images or videos, it is impossible to exhaustively describe all the multimedia data due to its huge scale. Therefore, the key to this important task is to develop an effective algorithm that can automatically generate a description of an image or a frame. In this paper, we propose a multimodal feature fusion framework which can model any given image-description pair using semantically meaningful features. This framework is trained as a combination of multi-modal deep networks having two integral components: An ensemble of image descriptors and a recursive bigram encoder with fixed length output feature vector. These two components are then integrated into a joint model characterizing the correlations between images and texts. The proposed framework can not only model the unique characteristics of images or texts, but also take into account their correlations at the semantic level. Experiments on real image-text data sets show that the proposed framework is effective and efficient in indexing and retrieving semantically similar pairs, which will be very useful to help people locate interesting images or videos in large-scale databases.