Unbiased estimation of size and other aggregates over hidden web databases

Authors:
Arjun Dasgupta;Xin Jin;Bradley Jewell;Nan Zhang;Gautam Das
Affiliations:
University of Texas at Arlington, Arlington, TX, USA;George Washington University, Washington, D.C., USA;University of Texas at Arlington, Arlington, TX, USA;George Washington University, Washington, D.C, USA;University of Texas at Arlington, Arlington, TX, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 22
Cited 14

Stochastic simulation

Stochastic simulation
Quickly generating billion-record synthetic databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Minimal probing: supporting expensive predicates for top-k queries

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Discovering the representative of a search engine

Proceedings of the eleventh international conference on Information and knowledge management
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Dynamic sample selection for approximate query processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A two-phase sampling technique for information extraction from hidden web databases

Proceedings of the 6th annual ACM international workshop on Web information and data management
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Capturing collection size for distributed non-cooperative retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating corpus size via queries

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Sampling, information extraction and summarisation of hidden web databases

Data & Knowledge Engineering - Special issue: WIDM 2004
Efficient search engine measurements

Proceedings of the 16th international conference on World Wide Web
A random walk approach to sampling hidden databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Random sampling from a search engine's index

Journal of the ACM (JACM)
Mining search engine query logs via suggestion sampling

Proceedings of the VLDB Endowment
Efficient estimation of the size of text deep web data source

Proceedings of the 17th ACM conference on Information and knowledge management
Leveraging COUNT Information in Sampling Hidden Databases

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Privacy preservation of aggregates in hidden databases: why and how?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Crawling the content hidden behind web forms

ICCSA'07 Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II

HengHa: data harvesting detection on hidden databases

Proceedings of the 2010 ACM workshop on Cloud computing security workshop
Effective and efficient sampling methods for deep web aggregation queries

Proceedings of the 14th International Conference on Extending Database Technology
Just-in-time analytics on large file systems

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Attribute domain discovery for hidden web databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
MOBIES: mobile-interface enhancement service for hidden web database

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Sampling hidden objects using nearest-neighbor oracles

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Effective stratification for low selectivity queries on deep web data sources

Proceedings of the 20th ACM international conference on Information and knowledge management
Optimal algorithms for crawling a hidden database in the web

Proceedings of the VLDB Endowment
Interactive pattern mining on hidden data: a sampling-based solution

Proceedings of the 21st ACM international conference on Information and knowledge management
Database Size Estimation by Query Performance -- A Complexity Aspect

UCC '12 Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing
Size estimation of non-cooperative data collections

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Mining a search engine's corpus without a query pool

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Rank discovery from web databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many websites provide restrictive form-like interfaces which allow users to execute search queries on the underlying hidden databases. In this paper, we consider the problem of estimating the size of a hidden database through its web interface. We propose novel techniques which use a small number of queries to produce unbiased estimates with small variance. These techniques can also be used for approximate query processing over hidden databases. We present theoretical analysis and extensive experiments to illustrate the effectiveness of our approach.