Efficient Similarity Search by Reducing I/O with Compressed Sketches

  • Authors:
  • Arnoldo José Muller-Molina;Takeshi Shinohara

  • Affiliations:
  • -;-

  • Venue:
  • SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Sketches are compact bit string representations of objects. Objects that have the same sketch are stored in the same database bucket. By calculating the hamming distance of the sketches, an estimation of the similarity of their respective objects can be obtained. Objects that are close to each other are expected to have sketches with small hamming distance values. This estimation helps to schedule the order in which buckets are visited during search time. Recent research has shown that sketches can effectively approximate $L_1$ and $L_2$ distances in high dimensional settings. A remaining task is to provide a general sketch for arbitrary metric spaces. This paper presents a novel sketch based on generalized hyperplane partitioning that can be employed on arbitrary metric spaces. The core of the sketch is a heuristic that tries to generate balanced partitions. The indexing method AESA stores all the distances among database objects, and this allows it to perform a small number of distance computations. Experimental evaluations show that given a good early termination strategy, our algorithm performs up to one order of magnitude fewer distance operations than AESA in string spaces. Comparisons against other methods show greater gains. Furthermore, we experimentally demonstrate that it is possible to reduce the physical size of the sketches by a factor of ten with different run length encodings.