A random walk approach to sampling hidden databases

  • Authors:
  • Arjun Dasgupta;Gautam Das;Heikki Mannila

  • Affiliations:
  • University of Texas at Arlington, Arlington, TX;University of Texas at Arlington, Arlington, TX;Helsinki University of Technology and University of Helsinki, Helsinki, Finland

  • Venue:
  • Proceedings of the 2007 ACM SIGMOD international conference on Management of data
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

A large part of the data on the World Wide Web is hidden behind form-like interfaces. These interfaces interact with a hidden back-end database to provide answers to user queries. Generating a uniform random sample of this hidden database by using only the publicly available interface gives us access to the underlying data distribution. In this paper, we propose a random walk scheme over the query space provided by the interface to sample such databases. We discuss variants where the query space is visualized as a fixed and random ordering of attributes. We also propose techniques to further improve the sample quality by using a probabilistic rejection based approach. We conduct extensive experiments to illustrate the accuracy and efficiency of our techniques.