On random sampling over joins

Authors:
Surajit Chaudhuri;Rajeev Motwani;Vivek Narasayya
Affiliations:
Microsoft Research;Stanford University;Microsoft Research
Venue:
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Year:
1999

Citing 12
Cited 108

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
On estimating the size of projections

ICDT '90 Proceedings of the third international conference on database theory on Database theory
Error-constrained COUNT query evaluation in relational databases

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Efficient sampling strategies for relational database operations

ICDT Selected papers of the 4th international conference on Database theory
On the relative cost of sampling for join selectivity estimation

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Randomized algorithms

Randomized algorithms
Bifocal sampling for skew-resistant join size estimation

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Sampling from Spatial Databases

Proceedings of the Ninth International Conference on Data Engineering
Simple Random Sampling from Relational Databases

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases

Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Spatial join selectivity using power laws

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Congressional samples for approximate answering of group-by queries

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Sampling from databases using B+-trees

Proceedings of the ninth international conference on Information and knowledge management
A robust, optimization-based approach for approximate answering of aggregate queries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Applying the golden rule of sampling for query estimation

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Maintaining stream statistics over sliding windows: (extended abstract)

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Continuous queries over data streams

ACM SIGMOD Record
Automatic tuning of data synopses

Information Systems - Special issue: Best papers from EDBT 2002
A Framework for the Physical Design Problem for Data Synopses

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
ICICLES: Self-Tuning Samples for Approximate Query Answering

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
Limiting Result Cardinalities for Multidatabase Queries Using Histograms

BNCOD 18 Proceedings of the 18th British National Conference on Databases: Advances in Databases
On producing join results early

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Interactive example-driven integration and reconciliation for accessing database federations

Information Systems
Approximate join processing over data streams

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Dynamic sample selection for approximate query processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets

IEEE Transactions on Knowledge and Data Engineering
Query Size Estimation for Joins Using Systematic Sampling

Distributed and Parallel Databases
A Selectivity Model for Fragmented Relations: Applied in Information Retrieval

IEEE Transactions on Knowledge and Data Engineering
Load Shedding for Aggregation Queries over Data Streams

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Flow sampling under hard resource constraints

Proceedings of the joint international conference on Measurement and modeling of computer systems
A bi-level Bernoulli scheme for database sampling

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Query sampling in DB2 Universal Database

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Semantic Approximation of Data Stream Joins

IEEE Transactions on Knowledge and Data Engineering
Synopses for query optimization: a space-complexity perspective

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Histograms revisited: when are histograms the best approximation method for aggregates over joins?

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating arbitrary subset sums with few probes

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Proactive re-optimization

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
RankSQL: query algebra and optimization for relational top-k queries

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Automatic physical database tuning: a relaxation-based approach

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
On joining and caching stochastic streams

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A disk-based join with probabilistic guarantees

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
When can we trust progress estimators for SQL queries?

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Early hash join: a configurable algorithm for the efficient and early production of join results

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Synopses for query optimization: A space-complexity perspective

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2004
Weighted random sampling with a reservoir

Information Processing Letters
Confidence intervals for priority sampling

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Graph-based synopses for relational selectivity estimation

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A dip in the reservoir: maintaining sample synopses of evolving datasets

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
XSKETCH synopses for XML data graphs

ACM Transactions on Database Systems (TODS)
Classification spanning correlated data streams

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
The Sort-Merge-Shrink join

ACM Transactions on Database Systems (TODS)
Random Sampling for Continuous Streams with Arbitrary Updates

IEEE Transactions on Knowledge and Data Engineering
Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more

Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more
Optimized stratified sampling for approximate query processing

ACM Transactions on Database Systems (TODS)
Detectives: detecting coalition hit inflation attacks in advertising networks streams

Proceedings of the 16th international conference on World Wide Web
Cardinality estimation using sample views with quality assurance

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Scalable approximate query processing with the DBO engine

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
The effect of reading policy on early join result production

Information Sciences: an International Journal
Effective change detection using sampling

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations

Computational Linguistics
Sampling from databases using B$^+$-Trees

Intelligent Data Analysis
Sampling streaming data with replacement

Computational Statistics & Data Analysis
GrubJoin: An Adaptive, Multi-Way, Windowed Stream Join with Time Correlation-Aware CPU Load Shedding

IEEE Transactions on Knowledge and Data Engineering
Priority sampling for estimation of arbitrary subset sums

Journal of the ACM (JACM)
Memory-limited execution of windowed stream joins

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Supporting time-constrained SQL queries in oracle

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
A stratified approach to progressive approximate joins

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Event dissemination via group-aware stream filtering

Proceedings of the second international conference on Distributed event-based systems
Scalable approximate query processing with the DBO engine

ACM Transactions on Database Systems (TODS)
Linked Bernoulli Synopses: Sampling along Foreign Keys

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Group-aware stream filtering for bandwidth-efficient data dissemination

International Journal of Parallel, Emergent and Distributed Systems - Best Papers from the WWASN2007 Workshop
Stream sampling for variance-optimal estimation of subset sums

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
The design of a query monitoring system

ACM Transactions on Database Systems (TODS)
TuG synopses for approximate query answering

ACM Transactions on Database Systems (TODS)
Semantics and implementation of continuous sliding window queries over data streams

ACM Transactions on Database Systems (TODS)
A sampling approach for XML query selectivity estimation

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Finding frequent co-occurring terms in relational keyword search

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Interactive query refinement

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Towards collaborative data reduction in stream-processing systems

International Journal of Communication Networks and Distributed Systems
Optimal sampling from sliding windows

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Generating example data for dataflow programs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
ROX: run-time optimization of XQueries

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Query optimizers: time to rethink the contract?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
StatAdvisor: recommending statistical views

Proceedings of the VLDB Endowment
Composable, scalable, and accurate weight summarization of unaggregated data sets

Proceedings of the VLDB Endowment
Consistent histograms in the presence of distinct value counts

Proceedings of the VLDB Endowment
Weighted random sampling with a reservoir

Information Processing Letters
An experimental study of time-constrained aggregate queries

Proceedings of the 13th International Conference on Extending Database Technology
Event-based lossy compression for effective and efficient OLAP over data streams

Data & Knowledge Engineering
Fast Manhattan sketches in data streams

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling dirty data for matching attributes

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
SQL query space and time complexity estimation for multidimensional queries

International Journal of Intelligent Information and Database Systems
Approximating sliding windows by cyclic tree-like histograms for efficient range queries

Data & Knowledge Engineering
Estimating set intersection using small samples

ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
A data-centric approach to insider attack detection in database systems

RAID'10 Proceedings of the 13th international conference on Recent advances in intrusion detection
Similarity join size estimation using locality sensitive hashing

Proceedings of the VLDB Endowment
The VC-dimension of SQL queries and selectivity estimation through sampling

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Optimal sampling from sliding windows

Journal of Computer and System Sciences
Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums

SIAM Journal on Computing
Hierarchical group-based sampling

BNCOD'05 Proceedings of the 22nd British National conference on Databases: enterprise, Skills and Innovation
What next?: a half-dozen data management research goals for big data and the cloud

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Non-linear data stream compression: foundations and theoretical results

HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part I
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Histograms as statistical estimators for aggregate queries

Information Systems
Efficiently adapting graphical models for selectivity estimation

The VLDB Journal — The International Journal on Very Large Data Bases
Cost exploration of data sharings in the cloud

Proceedings of the 16th International Conference on Extending Database Technology
xPAD: a platform for analytic data flows

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Optimus: a dynamic rewriting framework for data-parallel execution plans

Proceedings of the 8th ACM European Conference on Computer Systems
Scalable progressive analytics on big data in the cloud

Proceedings of the VLDB Endowment
A sampling algebra for aggregate estimation

Proceedings of the VLDB Endowment
Adaptive stratified reservoir sampling over heterogeneous data streams

Information Systems
Optimizing Sample Design for Approximate Query Processing

International Journal of Knowledge-Based Organizations

Quantified Score

Hi-index	0.00

Visualization

Abstract

A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. It is not even known whether it is possible to generate a sample of a join tree without first evaluating the join tree completely. We undertake a detailed study of this problem and attempt to analyze it in a variety of settings. We present theoretical results explaining the difficulty of this problem and setting limits on the efficiency that can be achieved. Based on new insights into the interaction between join and sampling, we develop join sampling techniques for the settings where our negative results do not apply. Our new sampling algorithms are significantly more efficient than those known earlier. We present experimental evaluation of our techniques on Microsoft's SQL Server 7.0.