Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation

Authors:
Mingyang Zhang;Nan Zhang;Gautam Das
Affiliations:
George Washington University, Washington, DC, USA;George Washington University, Washington, DC, USA;University of Texas at Arlington, Arlington, TX, USA
Venue:
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Year:
2011

Citing 13
Cited 2

A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
Efficient search engine measurements

Proceedings of the 16th international conference on World Wide Web
A random walk approach to sampling hidden databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Introduction to Information Retrieval

Introduction to Information Retrieval
Random sampling from a search engine's index

Journal of the ACM (JACM)
Leveraging COUNT Information in Sampling Hidden Databases

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Turbo-charging hidden database samplers with overflowing queries and skew reduction

Proceedings of the 13th International Conference on Extending Database Technology
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Aggregate suppression for enterprise search engines

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Mining a search engine's corpus without a query pool

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Search engines over document corpora typically provide keyword-search interfaces. Examples include search engines over the web as well as those over enterprise and government websites. The corpus of such a search engine forms a rich source of information of analytical interest to third parties, but the only available access is by issuing search queries through its interface. To support data analytics over a search engine's corpus, one needs to address two main problems, the sampling of documents (for offline analytics) and the direct (online) estimation of aggregates, while issuing a small number of queries through the keyword-search interface. Existing work on sampling produces samples with unknown bias and may incur an extremely high query cost. Existing aggregate estimation technique suffers from a similar problem, as the estimation error and query cost can both be large for certain aggregates. We propose novel techniques which produce unbiased samples as well as unbiased aggregate estimates with small variances while incurring a query cost an order of magnitude smaller than the existing techniques. We present theoretical analysis and extensive experiments to illustrate the effectiveness of our approach.