Query Size Estimation for Joins Using Systematic Sampling

Authors:
A. H. H. Ngu;B. Harangsri;J. Shepherd
Affiliations:
Department of Computer Science, Texas State University, San Marcos, Texas, USA. angu@swt.edu;National Electronics and Computer Technology, 112 Thailand Science Park, Paholyothin Rd., Pathumthani 12120, Thailand. banchong@notes.nectec.or.th;School of Computer Science and Engineering, University of New South Wales, 2052, Sydney, Australia. jas@cse.unsw.edu.au
Venue:
Distributed and Parallel Databases
Year:
2004

Citing 24
Cited 1

Equi-depth multidimensional histograms

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Processing aggregate relational queries with hard time constraints

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Instance-based prediction of real-valued attributes

Computational Intelligence
Practical selectivity estimation through adaptive sampling

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Instance-Based Learning Algorithms

Machine Learning
On the propagation of errors in the size of join results

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Error-constrained COUNT query evaluation in relational databases

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Sequential sampling procedures for query size estimation

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Fixed-precision estimation of join selectivity

PODS '93 Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
An instant and accurate size estimation method for joins and selections in a retrieval-intensive environment

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
On the estimation of join result sizes

EDBT '94 Proceedings of the 4th international conference on extending database technology: Advances in database technology
Adaptive selectivity estimation using query feedback

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Statistical estimators for relational algebra expressions

Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Ubiquitous B-Tree

ACM Computing Surveys (CSUR)
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
An Evaluation of Sampling-Based Size Estimation Methods for Selections in Database Systems

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Universality of Serial Histograms

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Query Size Estimation Using Machine Learning

Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
A study of instance-based algorithms for supervised learning tasks: mathematical, empirical, and psychological evaluations

A study of instance-based algorithms for supervised learning tasks: mathematical, empirical, and psychological evaluations
An integrated method for estimating selectivities in a multidatabase system

CASCON '93 Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2

The VC-dimension of SQL queries and selectivity estimation through sampling

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II

Quantified Score

Hi-index	0.01

Visualization

Abstract

We propose a new approach to the estimation of query result sizes for join queries. The technique, which we have called “systematic sampling—SYSSMP”, is a novel variant of the sampling-based approach. A key novelty of the systematic sampling is that it exploits the sortedness of data; the result of this is that the sample relation obtained well represents the underlying frequency distribution of the join attribute in the original relation.We first develop a theoretical foundation for systematic sampling which suggests that the method gives a more representative sample than the traditional simple random sampling. Subsequent experimental analysis on a range of synthetic relations confirms that the quality of sample relations yielded by systematic sampling is higher than those produced by the traditional simple random sampling.To ensure that sample relations produced by systematic sampling indeed assist in computing more accurate query result sizes, we compare systematic sampling with the most efficient simple random sampling called t_cross using a variety of relation configurations. The results obtained validate that systematic sampling uses the same amount of sampling but still provides more accurate query result sizes than t_cross. Furthermore, the extra sampling cost incurred by the use of systematic sampling pays off in a cheaper query execution cost at run-time.