ATLAS: a probabilistic algorithm for high dimensional similarity search

  • Authors:
  • Jiaqi Zhai;Yin Lou;Johannes Gehrke

  • Affiliations:
  • Cornell University, Ithaca, NY, USA;Cornell University, Ithaca, NY, USA;Cornell University, Ithaca, NY, USA

  • Venue:
  • Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given a set of high dimensional binary vectors and a similarity function (such as Jaccard and Cosine), we study the problem of finding all pairs of vectors whose similarity exceeds a given threshold. The solution to this problem is a key component in many applications with feature-rich objects, such as text, images, music, videos, or social networks. In particular, there are many important emerging applications that require the use of relatively low similarity thresholds. We propose ATLAS, a probabilistic similarity search algorithm that in expectation finds a 1 - δ fraction of all similar vector pairs. ATLAS uses truly random permutations both to filter candidate pairs of vectors and to estimate the similarity between vectors. At a 97.5% recall rate, ATLAS consistently outperforms all state-of-the-art approaches and achieves a speed-up of up to two orders of magnitude over both exact and approximate algorithms.