Efficient multi-modal retrieval in conceptual space

  • Authors:
  • Jun Imura;Teppei Fujisawa;Tatsuya Harada;Yasuo Kuniyoshi

  • Affiliations:
  • The University of Tokyo, Bunkyo-ku, Tokyo, Japan;The University of Tokyo, Bunkyo-ku, Tokyo, Japan;The University of Tokyo / JST PRESTO, Bunkyo-ku, Tokyo, Japan;The University of Tokyo, Bunkyo-ku, Tokyo, Japan

  • Venue:
  • MM '11 Proceedings of the 19th ACM international conference on Multimedia
  • Year:
  • 2011

Quantified Score

Hi-index 0.01

Visualization

Abstract

In this paper, we propose a new, efficient retrieval system for large-scale multi-modal data including video tracks. With large-scale multi-modal data, the huge data size and various contents cause degradation of efficiency and precision of retrieval results. Recent research on image annotation and retrieval shows that image features based on the Bag-of-Visual Words approach with local descriptors such as SIFT perform surprisingly well with large-scale image datasets. Those powerful descriptors tend to be high-dimensional, imposing a high computational cost for approximate nearest neighbor searching in raw feature space. Our video retrieval method is focused on the correlation between image, sound, and location information recorded simultaneously, and to learn conceptual space describing the contents of the data to realize efficient searching. Experiments show good performance of our retrieval system with low memory usage and temporal complexity.