Bifocal sampling for skew-resistant join size estimation

Authors:
Sumit Ganguly;Phillip B. Gibbons;Yossi Matias;Avi Silberschatz
Affiliations:
Rutgers University;Bell Laboratories;Bell Laboratories;Bell Laboratories
Venue:
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Year:
1996

Citing 13
Cited 33

Processing aggregate relational queries with hard time constraints

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Estimating the size of generalized transitive closures

VLDB '89 Proceedings of the 15th international conference on Very large data bases
Practical selectivity estimation through adaptive sampling

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Error-constrained COUNT query evaluation in relational databases

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Sequential sampling procedures for query size estimation

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
A supplement to sampling-based methods for query size estimation in a database system

ACM SIGMOD Record
Fixed-precision estimation of join selectivity

PODS '93 Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Efficient sampling strategies for relational database operations

ICDT Selected papers of the 4th international conference on Database theory
On the relative cost of sampling for join selectivity estimation

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Query size estimation by adaptive sampling

Selected papers of the 9th annual ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Query size estimation by adaptive sampling (extended abstract)

PODS '90 Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Statistical estimators for relational algebra expressions

Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases

Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Approximate Query Answering Using Data Warehouse Striping

Journal of Intelligent Information Systems - Special issue on data warehousing and knowledge discovery
Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-size Estimation

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
Containment join size estimation: models and methods

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Spectral bloom filters

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Selectivity Model for Fragmented Relations: Applied in Information Retrieval

IEEE Transactions on Knowledge and Data Engineering
A bi-level Bernoulli scheme for database sampling

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Online maintenance of very large random samples

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Query sampling in DB2 Universal Database

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A disk-based join with probabilistic guarantees

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Graph-based synopses for relational selectivity estimation

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
The Sort-Merge-Shrink join

ACM Transactions on Database Systems (TODS)
Random Sampling for Continuous Streams with Arbitrary Updates

IEEE Transactions on Knowledge and Data Engineering
Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more

Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more
Maintaining very large random samples using the geometric file

The VLDB Journal — The International Journal on Very Large Data Bases
Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets

ACM Transactions on Computer Systems (TOCS)
The design of a query monitoring system

ACM Transactions on Database Systems (TODS)
TuG synopses for approximate query answering

ACM Transactions on Database Systems (TODS)
A sampling approach for XML query selectivity estimation

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Progressive Evaluation of XML Queries for Online Aggregation and Progress Indicator

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Sampling dirty data for matching attributes

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Estimating set intersection using small samples

ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
Similarity join size estimation using locality sensitive hashing

Proceedings of the VLDB Endowment
The VC-dimension of SQL queries and selectivity estimation through sampling

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Practical algorithms for tracking database join sizes

FSTTCS '05 Proceedings of the 25th international conference on Foundations of Software Technology and Theoretical Computer Science
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Balancing reducer skew in MapReduce workloads using progressive sampling

Proceedings of the Third ACM Symposium on Cloud Computing
Robust runtime optimization and skew-resistant execution of analytical SPARQL queries on pig

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Integrating domain heterogeneous data sources using decomposition aggregation queries

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces bifocal sampling, a new technique for estimating the size of an equi-join of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join value. Distinct estimation procedures are employed that focus on various combinations for joining tuples (e.g., for estimating the number of joining tuples that are dense in both relations). This combination of estimation procedures overcomes some well-known problems in previous schemes, enabling good estimates with no a priori knowledge about the data distribution. The estimate obtained by the bifocal sampling algorithm is proven to lie with high probability within a small constant factor of the actual join size, regardless of the skew, as long as the join size is Ω(n lg n), for relations consisting of n tuples. The algorithm requires a sample of size at most O(√n lg n). By contrast, previous algorithms using a sample of similar size may require the join size to be Ω(n√n) to guarantee an accurate estimate. Experimental results support the theoretical claims and show that bifocal sampling is practical and effective.