Scaling up all pairs similarity search

  • Authors:
  • Roberto J. Bayardo;Yiming Ma;Ramakrishnan Srikant

  • Affiliations:
  • Google: Inc., Mountain View, CA;University of California: Irvine, Irvine, CA;Google: Inc., Mountain View, CA

  • Venue:
  • Proceedings of the 16th international conference on World Wide Web
  • Year:
  • 2007

Quantified Score

Hi-index 0.02

Visualization

Abstract

Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose a simple algorithm based on novel indexing and optimization strategies that solves this problem without relying on approximation methods or extensive parameter tuning. We show the approach efficiently handles a variety of datasets across a wide setting of similarity thresholds, with large speedups over previous state-of-the-art approaches.