Random sampling from a search engine's index

Authors:
Ziv Bar-Yossef;Maxim Gurevich
Affiliations:
Technion and Google Haifa, Haifa, Israel;Technion, Haifa, Israel
Venue:
Journal of the ACM (JACM)
Year:
2008

Citing 19
Cited 20

Algorithms for random generation and counting: a Markov chain approach

Algorithms for random generation and counting: a Markov chain approach
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
A Chernoff Bound for Random Walks on Expander Graphs

SIAM Journal on Computing
What do we know about the metropolis algorithm?

Journal of Computer and System Sciences
A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding information on the World Wide Web: the retrieval effectiveness of search engines

Information Processing and Management: an International Journal
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Measuring Search Engine Quality

Information Retrieval
Approximating Aggregate Queries about Web Pages via Random Walks

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
The Little Engines That Could: Modeling the Performance of World Wide Web Search Engines

Marketing Science
Large Deviation Bounds for Markov Chains

Combinatorics, Probability and Computing
Automatic performance evaluation of web search engines

Information Processing and Management: an International Journal
Fastest Mixing Markov Chain on a Graph

SIAM Review
Sampling search-engine results

WWW '05 Proceedings of the 14th international conference on World Wide Web
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Estimating corpus size via queries

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Efficient search engine measurements

Proceedings of the 16th international conference on World Wide Web
Agreeing to disagree: search engines and their public interfaces

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

Google stemming mechanisms

Journal of Information Science
Automatic retrieval of similar content using search engine query interface

Proceedings of the 18th ACM conference on Information and knowledge management
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
Estimating and sampling graphs with multidimensional random walks

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Estimating sizes of social networks via biased sampling

Proceedings of the 20th international conference on World wide web
Estimating dyslexia in the web

Proceedings of the International Cross-Disciplinary Conference on Web Accessibility
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
How much of the web is archived?

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Efficient Search Engine Measurements

ACM Transactions on the Web (TWEB)
Sampling hidden objects using nearest-neighbor oracles

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Applying the user-over-ranking hypothesis to query formulation

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Candidate document retrieval for web-scale text reuse detection

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
OPAL: automated form understanding for the deep web

Proceedings of the 21st international conference on World Wide Web
Sampling online social networks by random walk

Proceedings of the First ACM International Workshop on Hot Topics on Interdisciplinary Social Networks Research
Estimating clustering coefficients and size of social networks via random walk

Proceedings of the 22nd international conference on World Wide Web
Current challenges in web crawling

ICWE'13 Proceedings of the 13th international conference on Web Engineering
Towards social data platform: automatic topic-focused monitor for twitter stream

Proceedings of the VLDB Endowment
The ontological key: automatically understanding and integrating forms to access the deep Web

The VLDB Journal — The International Journal on Very Large Data Bases
On estimating the average degree

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.01

Visualization

Abstract

We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from a well-recorded bias: it favors long documents. In this article we introduce two novel sampling algorithms: a lexicon-based algorithm and a random walk algorithm. Our algorithms produce biased samples, but each sample is accompanied by a weight, which represents its bias. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to four well-known Monte Carlo simulation methods: rejection sampling, importance sampling, the Metropolis--Hastings algorithm, and the Maximum Degree method. The limited access to search engines force our algorithms to use bias weights that are only “approximate”. We characterize analytically the effect of approximate bias weights on Monte Carlo methods and conclude that our algorithms are guaranteed to produce near-uniform samples from the search engine's corpus. Our study of approximate Monte Carlo methods could be of independent interest. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long documents. We use our algorithms to collect comparative statistics about the corpora of the Google, MSN Search, and Yahoo! search engines.