Mining a search engine's corpus without a query pool

Authors:
Mingyang Zhang;Nan Zhang;Gautam Das
Affiliations:
George Washington University, Washington D.C. , USA;George Washington University, Washington D.C. , USA;University of Texas at Arlington, Qatar Computing Research Institute, Arlington, USA
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 15
Cited 0

A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query Selection Techniques for Efficient Crawling of Structured Web Sources

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
Efficient search engine measurements

Proceedings of the 16th international conference on World Wide Web
A random walk approach to sampling hidden databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Mining search engine query logs via suggestion sampling

Proceedings of the VLDB Endowment
Estimating the impressionrank of web pages

Proceedings of the 18th international conference on World wide web
Leveraging COUNT Information in Sampling Hidden Databases

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Estimating the Size of Online Social Networks

SOCIALCOM '10 Proceedings of the 2010 IEEE Second International Conference on Social Computing
Estimating sizes of social networks via biased sampling

Proceedings of the 20th international conference on World wide web
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Aggregate suppression for enterprise search engines

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Optimal algorithms for crawling a hidden database in the web

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many websites (e.g., WedMD.com, CNN.com) provide keyword search interfaces over a large corpus of documents. Meanwhile, many third parties (e.g., investors, analysts) are interested in learning big-picture analytical information over such a document corpus, but have no direct way of accessing it other than using the highly restrictive web search interface. In this paper, we study how to enable third-party data analytics over a search engine's corpus without the cooperation of its owner - specifically, by issuing a small number of search queries through the web interface. Almost all existing techniques require a pre-constructed query pool - i.e., a small yet comprehensive collection of queries which, if all issued through the search interface, can recall almost all documents in the corpus. The problem with this requirement is that a ``good'' query pool can only be constructed by someone with very specific knowledge (e.g., size, topic, special terms used, etc.) of the corpus, essentially leading to a chicken-and-egg problem. In this paper, we develop QG-SAMPLER and QG-ESTIMATOR, the first practical pool-free techniques for sampling and aggregate (e.g., SUM, COUNT, AVG) estimation over a search engine's corpus, respectively. Extensive real-world experiments show that our algorithms perform on-par with the state-of-the-art pool-based techniques equipped with a carefully tailored query pool, and significantly outperforms the latter when the query pool is a mismatch.