Estimating query result sizes for proxy caching in scientific database federations

  • Authors:
  • Tanu Malik;Randal Burns;Nitesh V. Chawla;Alex Szalay

  • Affiliations:
  • Johns Hopkins University, Baltimore, MD;Johns Hopkins University, Baltimore, MD;University of Notre Dame, Notre Dame, IN;Johns Hopkins University, Baltimore, MD

  • Venue:
  • Proceedings of the 2006 ACM/IEEE conference on Supercomputing
  • Year:
  • 2006

Quantified Score

Hi-index 0.01

Visualization

Abstract

In a proxy cache for federations of scientific databases it is important to estimate the size of a query before making a caching decision. With accurate estimates, near-optimal cache performance can be obtained. On the other extreme, inaccurate estimates can render the cache totally ineffective.We present classification and regression over templates (CAROT), a general method for estimating query result sizes, which is suited to the resource-limited environment of proxy caches and the distributed nature of database federations. CAROT estimates query result sizes by learning the distribution of query results, not by examining or sampling data, but from observing workload. We have integrated CAROT into the proxy cache of the National Virtual Observatory (NVO) federation of astronomy databases. Experiments conducted in the NVO show that CAROT dramatically outperforms conventional estimation techniques and provides near-optimal cache performance.