A random walk approach to sampling hidden databases

Authors:
Arjun Dasgupta;Gautam Das;Heikki Mannila
Affiliations:
University of Texas at Arlington, Arlington, TX;University of Texas at Arlington, Arlington, TX;Helsinki University of Technology and University of Helsinki, Helsinki, Finland
Venue:
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Year:
2007

Citing 10
Cited 24

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Automatic discovery of language models for text databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Probe, count, and classify: categorizing hidden web databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
A bi-level Bernoulli scheme for database sampling

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web

Mining search engine query logs via suggestion sampling

Proceedings of the VLDB Endowment
Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Privacy preservation of aggregates in hidden databases: why and how?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
HDSampler: revealing data behind web form interfaces

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A Device Search Strategy Based on Connections History for Patient Monitoring

IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part II: Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living
Privacy risks in health databases from aggregate disclosure

Proceedings of the 2nd International Conference on PErvasive Technologies Related to Assistive Environments
Turbo-charging hidden database samplers with overflowing queries and skew reduction

Proceedings of the 13th International Conference on Extending Database Technology
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
HengHa: data harvesting detection on hidden databases

Proceedings of the 2010 ACM workshop on Cloud computing security workshop
An access cost-aware approach for object retrieval over multiple sources

Proceedings of the VLDB Endowment
Approximate content summary for database selection in deep web data integration

WAIM'10 Proceedings of the 2010 international conference on Web-age information management
Effective and efficient sampling methods for deep web aggregation queries

Proceedings of the 14th International Conference on Extending Database Technology
Just-in-time analytics on large file systems

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Attribute domain discovery for hidden web databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient Search Engine Measurements

ACM Transactions on the Web (TWEB)
Sampling hidden objects using nearest-neighbor oracles

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Effective stratification for low selectivity queries on deep web data sources

Proceedings of the 20th ACM international conference on Information and knowledge management
Stratified k-means clustering over a deep web data source

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Database Size Estimation by Query Performance -- A Complexity Aspect

UCC '12 Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing
Materialization of web data sources

Search Computing
Assessing relevance and trust of the deep web sources and results based on inter-source agreement

ACM Transactions on the Web (TWEB)
Mining a search engine's corpus without a query pool

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Rank discovery from web databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large part of the data on the World Wide Web is hidden behind form-like interfaces. These interfaces interact with a hidden back-end database to provide answers to user queries. Generating a uniform random sample of this hidden database by using only the publicly available interface gives us access to the underlying data distribution. In this paper, we propose a random walk scheme over the query space provided by the interface to sample such databases. We discuss variants where the query space is visualized as a fixed and random ordering of attributes. We also propose techniques to further improve the sample quality by using a probabilistic rejection based approach. We conduct extensive experiments to illustrate the accuracy and efficiency of our techniques.