Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
A technique for measuring the relative size and overlap of public Web search engines
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Automatic discovery of language models for text databases
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Probe, count, and classify: categorizing hidden web databases
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes
Proceedings of the 27th International Conference on Very Large Data Bases
A bi-level Bernoulli scheme for database sampling
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Effective use of block-level sampling in statistics estimation
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Random sampling from a search engine's index
Proceedings of the 15th international conference on World Wide Web
Mining search engine query logs via suggestion sampling
Proceedings of the VLDB Endowment
Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration
DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Privacy preservation of aggregates in hidden databases: why and how?
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
HDSampler: revealing data behind web form interfaces
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A Device Search Strategy Based on Connections History for Patient Monitoring
IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part II: Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living
Privacy risks in health databases from aggregate disclosure
Proceedings of the 2nd International Conference on PErvasive Technologies Related to Assistive Environments
Turbo-charging hidden database samplers with overflowing queries and skew reduction
Proceedings of the 13th International Conference on Extending Database Technology
Unbiased estimation of size and other aggregates over hidden web databases
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
HengHa: data harvesting detection on hidden databases
Proceedings of the 2010 ACM workshop on Cloud computing security workshop
An access cost-aware approach for object retrieval over multiple sources
Proceedings of the VLDB Endowment
Approximate content summary for database selection in deep web data integration
WAIM'10 Proceedings of the 2010 international conference on Web-age information management
Effective and efficient sampling methods for deep web aggregation queries
Proceedings of the 14th International Conference on Extending Database Technology
Just-in-time analytics on large file systems
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Attribute domain discovery for hidden web databases
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient Search Engine Measurements
ACM Transactions on the Web (TWEB)
Sampling hidden objects using nearest-neighbor oracles
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Effective stratification for low selectivity queries on deep web data sources
Proceedings of the 20th ACM international conference on Information and knowledge management
Stratified k-means clustering over a deep web data source
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Database Size Estimation by Query Performance -- A Complexity Aspect
UCC '12 Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing
Materialization of web data sources
Search Computing
Assessing relevance and trust of the deep web sources and results based on inter-source agreement
ACM Transactions on the Web (TWEB)
Mining a search engine's corpus without a query pool
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Rank discovery from web databases
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
A large part of the data on the World Wide Web is hidden behind form-like interfaces. These interfaces interact with a hidden back-end database to provide answers to user queries. Generating a uniform random sample of this hidden database by using only the publicly available interface gives us access to the underlying data distribution. In this paper, we propose a random walk scheme over the query space provided by the interface to sample such databases. We discuss variants where the query space is visualized as a fixed and random ordering of attributes. We also propose techniques to further improve the sample quality by using a probabilistic rejection based approach. We conduct extensive experiments to illustrate the accuracy and efficiency of our techniques.