Hashed samples: selectivity estimators for set similarity selection queries

  • Authors:
  • Marios Hadjieleftheriou;Xiaohui Yu;Nick Koudas;Divesh Srivastava

  • Affiliations:
  • AT&T Labs-Research, Florham Park NJ;York University, Toronto ON, Canada;University of Toronto, Toronto ON, Canada;AT&T Labs-Research, Florham Park NJ

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study selectivity estimation techniques for set similarity queries. A wide variety of similarity measures for sets have been proposed in the past. In this work we concentrate on the class of weighted similarity measures (e.g., TF/IDF and BM25 cosine similarity and variants) and design selectivity estimators based on a priori constructed samples. First, we study the pitfalls associated with straightforward applications of random sampling, and argue that care needs to be taken in how the samples are constructed; uniform random sampling yields very low accuracy, while query sensitive realtime sampling is more expensive than exact solutions (both in CPU and I/O cost). We show how to build robust samples a priori, based on existing synopses for distinct value estimation. We prove the accuracy of our technique theoretically, and verify its performance experimentally. Our algorithm is orders of magnitude faster than exact solutions and has very small space overhead.