A multimodal framework for unsupervised feature fusion

  • Authors:
  • Xiaoyi Li;Jing Gao;Hui Li;Le Yang;Rohini K. Srihari

  • Affiliations:
  • SUNY at Buffalo, buffalo, New York, USA;SUNY at Buffalo, buffalo, New York, USA;SUNY at Buffalo, buffalo, New York, USA;SUNY at Buffalo, buffalo, New York, USA;SUNY at Buffalo, buffalo, New York, USA

  • Venue:
  • Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the overwhelming amounts of visual contents on the Internet nowadays, it is very important to generate meaningful and succinct descriptions of multimedia contents including images and videos. Although human taggings and annotations can partially label some of the images or videos, it is impossible to exhaustively describe all the multimedia data due to its huge scale. Therefore, the key to this important task is to develop an effective algorithm that can automatically generate a description of an image or a frame. In this paper, we propose a multimodal feature fusion framework which can model any given image-description pair using semantically meaningful features. This framework is trained as a combination of multi-modal deep networks having two integral components: An ensemble of image descriptors and a recursive bigram encoder with fixed length output feature vector. These two components are then integrated into a joint model characterizing the correlations between images and texts. The proposed framework can not only model the unique characteristics of images or texts, but also take into account their correlations at the semantic level. Experiments on real image-text data sets show that the proposed framework is effective and efficient in indexing and retrieving semantically similar pairs, which will be very useful to help people locate interesting images or videos in large-scale databases.