Effective and efficient sampling methods for deep web aggregation queries

  • Authors:
  • Fan Wang;Gagan Agrawal

  • Affiliations:
  • The Ohio State University;The Ohio State University

  • Venue:
  • Proceedings of the 14th International Conference on Extending Database Technology
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

A large part of the data on the World Wide Web resides in the deep web. Executing structured, high-level queries on deep web data sources involves a number of challenges, several of which arise because query execution engines have a very limited access to data. In this paper, we consider the problem of executing aggregation queries involving data enumeration on these data sources, which requires sampling. The existing work in this area (HDSampler and its variants) is based on simple random sampling. We observe that this approach cannot obtain good estimates when the data is skewed. While there has been a lot of work on sampling skewed data, the existing methods are based on prior knowledge of data, and are therefore not applicable to hidden databases. In this paper, we present two prior-knowledge-free sampling algorithms, Adaptive Neighborhood Sampling (ANS) and Two Phase adaptive Sampling (TPS), which allow an aggregation query to be answered with a high accuracy (even when there is a skew), and a low sampling cost. For this purpose, we have developed robust estimators for aggregation functions including AVG, MAX, and MIN. Our experiments show that for data with a moderate or a large skew, ANS and TPS yield more accurate estimates, outperforming HDSampler by a factor of 4 on the average. Even for the cases where data has a small skew, our TPS method has an important advantage, which is that it has only one-third of the sampling costs of HDSampler.