Estimating query result sizes for proxy caching in scientific database federations

Authors:
Tanu Malik;Randal Burns;Nitesh V. Chawla;Alex Szalay
Affiliations:
Johns Hopkins University, Baltimore, MD;Johns Hopkins University, Baltimore, MD;University of Notre Dame, Notre Dame, IN;Johns Hopkins University, Baltimore, MD
Venue:
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Year:
2006

Citing 30
Cited 3

Federated database systems for managing distributed, heterogeneous, and autonomous databases

ACM Computing Surveys (CSUR) - Special issue on heterogeneous databases
C4.5: programs for machine learning

C4.5: programs for machine learning
An instant and accurate size estimation method for joins and selections in a retrieval-intensive environment

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Adaptive selectivity estimation using query feedback

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Query size estimation by adaptive sampling

Selected papers of the 9th annual ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Self-tuning histograms: building histograms without looking at data

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Approximate computation of multidimensional aggregates of sparse data using wavelets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Approximating multi-dimensional aggregate range queries over real attributes

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
STHoles: a multidimensional workload-aware histogram

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Adaptive precision setting for cached approximate values

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Machine Learning

Machine Learning
Best-effort cache synchronization with source cooperation

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Exploiting statistics on query expressions for optimization

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
The SDSS skyserver: public access to the sloan digital sky server data

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Estimating block transfers and join sizes

SIGMOD '83 Proceedings of the 1983 ACM SIGMOD international conference on Management of data
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Approximate Query Processing Using Wavelets

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Form-Based Proxy Caching for Database-Backed Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
LEO - DB2's LEarning Optimizer

Proceedings of the 27th International Conference on Very Large Data Bases
Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Average-Case Competitive Analyses for Ski-Rental Problems

ISAAC '02 Proceedings of the 13th International Symposium on Algorithms and Computation
Query Size Estimation Using Machine Learning

Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
Function Proxy: Template-Based Proxy Caching for Table-Valued Functions

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Bypass Caching: Making Scientific Databases Good Network Citizens

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
CXHist: an on-line classification-based histogram for XML string selectivity estimation

VLDB '05 Proceedings of the 31st international conference on Very large data bases
"Missing Is Useful': Missing Values in Cost-Sensitive Decision Trees

IEEE Transactions on Knowledge and Data Engineering
Cost-aware WWW proxy caching algorithms

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Request Window: an approach to improve throughput of RDBMS-based data integration system by utilizing data sharing across concurrent distributed queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Workload-Aware Histograms for Remote Applications

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Caching and Materialization for Web Databases

Foundations and Trends in Databases

Quantified Score

Hi-index	0.01

Visualization

Abstract

In a proxy cache for federations of scientific databases it is important to estimate the size of a query before making a caching decision. With accurate estimates, near-optimal cache performance can be obtained. On the other extreme, inaccurate estimates can render the cache totally ineffective.We present classification and regression over templates (CAROT), a general method for estimating query result sizes, which is suited to the resource-limited environment of proxy caches and the distributed nature of database federations. CAROT estimates query result sizes by learning the distribution of query results, not by examining or sampling data, but from observing workload. We have integrated CAROT into the proxy cache of the National Virtual Observatory (NVO) federation of astronomy databases. Experiments conducted in the NVO show that CAROT dramatically outperforms conventional estimation techniques and provides near-optimal cache performance.