Effective and efficient sampling methods for deep web aggregation queries

Authors:
Fan Wang;Gagan Agrawal
Affiliations:
The Ohio State University;The Ohio State University
Venue:
Proceedings of the 14th International Conference on Extending Database Technology
Year:
2011

Citing 22
Cited 1

Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A robust, optimization-based approach for approximate answering of aggregate queries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Overcoming Limitations of Sampling for Aggregation Queries

Proceedings of the 17th International Conference on Data Engineering
Approximate query processing using wavelets

The VLDB Journal — The International Journal on Very Large Data Bases
Fast Approximate Query Answering Using Precomputed Statistics

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Dynamic sample selection for approximate query processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
New Sampling-Based Estimators for OLAP Queries

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Accessing the deep web

Communications of the ACM - ACM at sixty: a look back in time
Optimized stratified sampling for approximate query processing

ACM Transactions on Database Systems (TODS)
A random walk approach to sampling hidden databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Scalable approximate query processing with the DBO engine

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Robust estimation with sampling and approximate pre-aggregation

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
A Bayesian method for guessing the extreme values in a data set?

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Adaptive-sampling algorithms for answering aggregation queries on Web sites

Data & Knowledge Engineering
Sampling-based estimators for subset-based queries

The VLDB Journal — The International Journal on Very Large Data Bases
Robust Stratified Sampling Plans for Low Selectivity Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Leveraging COUNT Information in Sampling Hidden Databases

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Turbo-charging hidden database samplers with overflowing queries and skew reduction

Proceedings of the 13th International Conference on Extending Database Technology
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Differential Analysis on Deep Web Data Sources

ICDMW '10 Proceedings of the 2010 IEEE International Conference on Data Mining Workshops
Seedeep: a system for exploring and querying deep web data sources

Seedeep: a system for exploring and querying deep web data sources

Rank discovery from web databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large part of the data on the World Wide Web resides in the deep web. Executing structured, high-level queries on deep web data sources involves a number of challenges, several of which arise because query execution engines have a very limited access to data. In this paper, we consider the problem of executing aggregation queries involving data enumeration on these data sources, which requires sampling. The existing work in this area (HDSampler and its variants) is based on simple random sampling. We observe that this approach cannot obtain good estimates when the data is skewed. While there has been a lot of work on sampling skewed data, the existing methods are based on prior knowledge of data, and are therefore not applicable to hidden databases. In this paper, we present two prior-knowledge-free sampling algorithms, Adaptive Neighborhood Sampling (ANS) and Two Phase adaptive Sampling (TPS), which allow an aggregation query to be answered with a high accuracy (even when there is a skew), and a low sampling cost. For this purpose, we have developed robust estimators for aggregation functions including AVG, MAX, and MIN. Our experiments show that for data with a moderate or a large skew, ANS and TPS yield more accurate estimates, outperforming HDSampler by a factor of 4 on the average. Even for the cases where data has a small skew, our TPS method has an important advantage, which is that it has only one-third of the sampling costs of HDSampler.