SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Density biased sampling: an improved method for data mining and clustering
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A robust, optimization-based approach for approximate answering of aggregate queries
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Overcoming Limitations of Sampling for Aggregation Queries
Proceedings of the 17th International Conference on Data Engineering
Approximate query processing using wavelets
The VLDB Journal — The International Journal on Very Large Data Bases
Fast Approximate Query Answering Using Precomputed Statistics
ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Dynamic sample selection for approximate query processing
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
New Sampling-Based Estimators for OLAP Queries
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Communications of the ACM - ACM at sixty: a look back in time
Optimized stratified sampling for approximate query processing
ACM Transactions on Database Systems (TODS)
A random walk approach to sampling hidden databases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Scalable approximate query processing with the DBO engine
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Robust estimation with sampling and approximate pre-aggregation
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
A Bayesian method for guessing the extreme values in a data set?
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Adaptive-sampling algorithms for answering aggregation queries on Web sites
Data & Knowledge Engineering
Sampling-based estimators for subset-based queries
The VLDB Journal — The International Journal on Very Large Data Bases
Robust Stratified Sampling Plans for Low Selectivity Queries
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Leveraging COUNT Information in Sampling Hidden Databases
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Turbo-charging hidden database samplers with overflowing queries and skew reduction
Proceedings of the 13th International Conference on Extending Database Technology
Unbiased estimation of size and other aggregates over hidden web databases
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Differential Analysis on Deep Web Data Sources
ICDMW '10 Proceedings of the 2010 IEEE International Conference on Data Mining Workshops
Seedeep: a system for exploring and querying deep web data sources
Seedeep: a system for exploring and querying deep web data sources
Rank discovery from web databases
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
A large part of the data on the World Wide Web resides in the deep web. Executing structured, high-level queries on deep web data sources involves a number of challenges, several of which arise because query execution engines have a very limited access to data. In this paper, we consider the problem of executing aggregation queries involving data enumeration on these data sources, which requires sampling. The existing work in this area (HDSampler and its variants) is based on simple random sampling. We observe that this approach cannot obtain good estimates when the data is skewed. While there has been a lot of work on sampling skewed data, the existing methods are based on prior knowledge of data, and are therefore not applicable to hidden databases. In this paper, we present two prior-knowledge-free sampling algorithms, Adaptive Neighborhood Sampling (ANS) and Two Phase adaptive Sampling (TPS), which allow an aggregation query to be answered with a high accuracy (even when there is a skew), and a low sampling cost. For this purpose, we have developed robust estimators for aggregation functions including AVG, MAX, and MIN. Our experiments show that for data with a moderate or a large skew, ANS and TPS yield more accurate estimates, outperforming HDSampler by a factor of 4 on the average. Even for the cases where data has a small skew, our TPS method has an important advantage, which is that it has only one-third of the sampling costs of HDSampler.