A technique for measuring the relative size and overlap of public Web search engines
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Downloading textual hidden web content through keyword queries
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query Selection Techniques for Efficient Crawling of Structured Web Sources
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Random sampling from a search engine's index
Proceedings of the 15th international conference on World Wide Web
Efficient search engine measurements
Proceedings of the 16th international conference on World Wide Web
A random walk approach to sampling hidden databases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Mining search engine query logs via suggestion sampling
Proceedings of the VLDB Endowment
Estimating the impressionrank of web pages
Proceedings of the 18th international conference on World wide web
Leveraging COUNT Information in Sampling Hidden Databases
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Unbiased estimation of size and other aggregates over hidden web databases
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Estimating the Size of Online Social Networks
SOCIALCOM '10 Proceedings of the 2010 IEEE Second International Conference on Social Computing
Estimating sizes of social networks via biased sampling
Proceedings of the 20th international conference on World wide web
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Aggregate suppression for enterprise search engines
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Optimal algorithms for crawling a hidden database in the web
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Many websites (e.g., WedMD.com, CNN.com) provide keyword search interfaces over a large corpus of documents. Meanwhile, many third parties (e.g., investors, analysts) are interested in learning big-picture analytical information over such a document corpus, but have no direct way of accessing it other than using the highly restrictive web search interface. In this paper, we study how to enable third-party data analytics over a search engine's corpus without the cooperation of its owner - specifically, by issuing a small number of search queries through the web interface. Almost all existing techniques require a pre-constructed query pool - i.e., a small yet comprehensive collection of queries which, if all issued through the search interface, can recall almost all documents in the corpus. The problem with this requirement is that a ``good'' query pool can only be constructed by someone with very specific knowledge (e.g., size, topic, special terms used, etc.) of the corpus, essentially leading to a chicken-and-egg problem. In this paper, we develop QG-SAMPLER and QG-ESTIMATOR, the first practical pool-free techniques for sampling and aggregate (e.g., SUM, COUNT, AVG) estimation over a search engine's corpus, respectively. Extensive real-world experiments show that our algorithms perform on-par with the state-of-the-art pool-based techniques equipped with a carefully tailored query pool, and significantly outperforms the latter when the query pool is a mismatch.