Adaptive-sampling algorithms for answering aggregation queries on Web sites

Authors:
Foto N. Afrati;Paraskevas V. Lekeas;Chen Li
Affiliations:
Computer Science Division, NTUA, Athens, Greece;University of Crete, Department of Applied Mathematics, Leoforos Knossou, Hrakleio, 714 09 Crete, Greece;Department of Computer Science, UC Irvine, CA 92697, USA
Venue:
Data & Knowledge Engineering
Year:
2008

Citing 12
Cited 3

Practical selectivity estimation through adaptive sampling

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Efficient sampling strategies for relational database operations

ICDT Selected papers of the 4th international conference on Database theory
On the relative cost of sampling for join selectivity estimation

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Query size estimation by adaptive sampling (extended abstract)

PODS '90 Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Random sampling techniques for space efficient online computation of order statistics of large datasets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Querying and mining data streams: you only get one look a tutorial

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Overcoming Limitations of Sampling for Aggregation Queries

Proceedings of the 17th International Conference on Data Engineering
Sequential Sampling Algorithms: Unified Analysis and Lower Bounds

SAGA '01 Proceedings of the International Symposium on Stochastic Algorithms: Foundations and Applications
Answering aggregation queries on hierarchical web sites using adaptive sampling

Proceedings of the 14th ACM international conference on Information and knowledge management
Data Stream Management: Processing High-Speed Data Streams (Data-Centric Systems and Applications)

Data Stream Management: Processing High-Speed Data Streams (Data-Centric Systems and Applications)
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling

Data & Knowledge Engineering
Effective and efficient sampling methods for deep web aggregation queries

Proceedings of the 14th International Conference on Extending Database Technology
Rank discovery from web databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many Web sites publish their data in a hierarchical structure. For instance, Amazon.com organizes its pages on books as a hierarchy, in which each leaf node corresponds to a collection of pages of books in the same class (e.g., books on Data Mining). Users can easily browse this class by following a path from the root to the corresponding leaf node, such as ''Computers &Internet -Databases -Storage -Data Mining''. Business applications often require to submit aggregation queries on such data, such as ''finding the average price of books on Data Mining''. On the other hand, it is computationally expensive to compute the exact answer to such a query due to the large amount of data, its dynamicity, and limited Web-access resources. In this paper, we study how to answer such aggregation queries approximately with quality guarantees using sampling. We study how to use adaptive-sampling techniques that allocate the resources adaptively based on partial samples retrieved from different nodes in the hierarchy. Based on statistical methods, we study how to estimate the quality of the answer using the sample. Our experimental study using real and synthetic data sets validates the proposed techniques.