Indexing expensive functions for efficient multi-dimensional similarity search

  • Authors:
  • Hanxiong Chen;Jianquan Liu;Kazutaka Furuse;Jeffrey Xu Yu;Nobuo Ohbo

  • Affiliations:
  • University of Tsukuba, Department of Computer Science, Graduate School of Systems and Information Engineering, 1-1-1 Tennodai, 305-8577, Tsukuba-shi, Ibaraki-ken, Japan;University of Tsukuba, Department of Computer Science, Graduate School of Systems and Information Engineering, 1-1-1 Tennodai, 305-8577, Tsukuba-shi, Ibaraki-ken, Japan;University of Tsukuba, Department of Computer Science, Graduate School of Systems and Information Engineering, 1-1-1 Tennodai, 305-8577, Tsukuba-shi, Ibaraki-ken, Japan;Chinese University of Hong Kong, Department of Systems Engineering and Engineering Management, Sha Tin, Hong Kong, China;University of Tsukuba, Department of Computer Science, Graduate School of Systems and Information Engineering, 1-1-1 Tennodai, 305-8577, Tsukuba-shi, Ibaraki-ken, Japan

  • Venue:
  • Knowledge and Information Systems - Special Issue: Best Papers of the Fifth International Conference on Advanced Data Mining and Applications (ADMA 2009)
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Similarity search is important in information retrieval applications where objects are usually represented as vectors of high dimensionality. This leads to the increasing need for supporting the indexing of high-dimensional data. On the other hand, indexing structures based on space partitioning are powerless because of the well-known “curse of dimensionality”. Linear scan of the data with approximation is more efficient in the high-dimensional similarity search. However, approaches so far have concentrated on reducing I/O, and ignored the computation cost. For an expensive distance function such as L p norm with fractional p, the computation cost becomes the bottleneck. We propose a new technique to address expensive distance functions by “indexing the function” by pre-computing some key values of the function once. Then, the values are used to develop the upper/lower bounds of the distance between a data vector and the query vector. The technique is extremely efficient since it avoids most of the distance function computations; moreover, it does not involve any extra secondary storage because no index is constructed and stored. The efficiency is confirmed by cost analysis, as well as experiments on synthetic and real data.