Multimodal fusion using learned text concepts for image categorization

  • Authors:
  • Qiang Zhu;Mei-Chen Yeh;Kwang-Ting Cheng

  • Affiliations:
  • University of California, Santa Barbara, CA;University of California, Santa Barbara, CA;University of California, Santa Barbara, CA

  • Venue:
  • MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Conventional image categorization techniques primarily rely on low-level visual cues. In this paper, we describe a multimodal fusion scheme which improves the image classification accuracy by incorporating the information derived from the embedded texts detected in the image under classification. Specific to each image category, a text concept is first learned from a set of labeled texts in images of the target category using Multiple Instance Learning [1]. For an image under classification which contains multiple detected text lines, we calculate a weighted Euclidian distance between each text line and the learned text concept of the target category. Subsequently, the minimum distance, along with low-level visual cues, are jointly used as the features for SVM-based classification. Experiments on a challenging image database demonstrate that the proposed fusion framework achieves a higher accuracy than the state-of-art methods for image classification.