Hashed samples: selectivity estimators for set similarity selection queries

Authors:
Marios Hadjieleftheriou;Xiaohui Yu;Nick Koudas;Divesh Srivastava
Affiliations:
AT&T Labs-Research, Florham Park NJ;York University, Toronto ON, Canada;University of Toronto, Toronto ON, Canada;AT&T Labs-Research, Florham Park NJ
Venue:
Proceedings of the VLDB Endowment
Year:
2008

Citing 25
Cited 15

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
Epsilon-nets and simplex range queries

SCG '86 Proceedings of the second annual symposium on Computational geometry
Learnability and the Vapnik-Chervonenkis dimension

Journal of the ACM (JACM)
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Linear hash functions

Journal of the ACM (JACM)
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Modern Information Retrieval

Modern Information Retrieval
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
One-dimensional and multi-dimensional substring selectivity estimation

The VLDB Journal — The International Journal on Very Large Data Bases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
A bi-level Bernoulli scheme for database sampling

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Selectivity estimation for fuzzy string predicates in large data sets

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Reasoning About Approximate Match Query Results

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Estimating the selectivity of approximate string queries

ACM Transactions on Database Systems (TODS)
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Benchmarking declarative approximate selection predicates

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Extending q-grams to estimate selectivity of string matching with low edit distance

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Estimating the selectivity of tf-idf based cosine similarity predicates

ACM SIGMOD Record
Fast Indexes and Algorithms for Set Similarity Selection Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distinct-value synopses for multiset operations

Communications of the ACM - A View of Parallel Computing
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
Power-law based estimation of set similarity join size

Proceedings of the VLDB Endowment
Efficient set-correlation operator inside databases

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Similarity join size estimation using locality sensitive hashing

Proceedings of the VLDB Endowment
Get the most out of your sample: optimal unbiased estimators using partial information

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
CRSI: a compact randomized similarity index for set-valued features

Proceedings of the 15th International Conference on Extending Database Technology
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Fast evaluation of iceberg pattern-based aggregate queries

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study selectivity estimation techniques for set similarity queries. A wide variety of similarity measures for sets have been proposed in the past. In this work we concentrate on the class of weighted similarity measures (e.g., TF/IDF and BM25 cosine similarity and variants) and design selectivity estimators based on a priori constructed samples. First, we study the pitfalls associated with straightforward applications of random sampling, and argue that care needs to be taken in how the samples are constructed; uniform random sampling yields very low accuracy, while query sensitive realtime sampling is more expensive than exact solutions (both in CPU and I/O cost). We show how to build robust samples a priori, based on existing synopses for distinct value estimation. We prove the accuracy of our technique theoretically, and verify its performance experimentally. Our algorithm is orders of magnitude faster than exact solutions and has very small space overhead.