Algorithms for random generation and counting: a Markov chain approach
Algorithms for random generation and counting: a Markov chain approach
Matrix computations (3rd ed.)
A Chernoff Bound for Random Walks on Expander Graphs
SIAM Journal on Computing
What do we know about the metropolis algorithm?
Journal of Computer and System Sciences
A technique for measuring the relative size and overlap of public Web search engines
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding information on the World Wide Web: the retrieval effectiveness of search engines
Information Processing and Management: an International Journal
Measuring index quality using random walks on the Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Measuring Search Engine Quality
Information Retrieval
Approximating Aggregate Queries about Web Pages via Random Walks
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Large Deviation Bounds for Markov Chains
Combinatorics, Probability and Computing
Automatic performance evaluation of web search engines
Information Processing and Management: an International Journal
Fastest Mixing Markov Chain on a Graph
SIAM Review
Sampling search-engine results
WWW '05 Proceedings of the 14th international conference on World Wide Web
The indexable web is more than 11.5 billion pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Estimating corpus size via queries
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Efficient search engine measurements
Proceedings of the 16th international conference on World Wide Web
Agreeing to disagree: search engines and their public interfaces
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Journal of Information Science
Automatic retrieval of similar content using search engine query interface
Proceedings of the 18th ACM conference on Information and knowledge management
Unbiased estimation of size and other aggregates over hidden web databases
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ranking bias in deep web size estimation using capture recapture method
Data & Knowledge Engineering
Estimating and sampling graphs with multidimensional random walks
IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Estimating sizes of social networks via biased sampling
Proceedings of the 20th international conference on World wide web
Estimating dyslexia in the web
Proceedings of the International Cross-Disciplinary Conference on Web Accessibility
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
How much of the web is archived?
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Efficient Search Engine Measurements
ACM Transactions on the Web (TWEB)
Sampling hidden objects using nearest-neighbor oracles
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Applying the user-over-ranking hypothesis to query formulation
ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Candidate document retrieval for web-scale text reuse detection
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
OPAL: automated form understanding for the deep web
Proceedings of the 21st international conference on World Wide Web
Sampling online social networks by random walk
Proceedings of the First ACM International Workshop on Hot Topics on Interdisciplinary Social Networks Research
Estimating clustering coefficients and size of social networks via random walk
Proceedings of the 22nd international conference on World Wide Web
Current challenges in web crawling
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Towards social data platform: automatic topic-focused monitor for twitter stream
Proceedings of the VLDB Endowment
The ontological key: automatically understanding and integrating forms to access the deep Web
The VLDB Journal — The International Journal on Very Large Data Bases
On estimating the average degree
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.01 |
We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from a well-recorded bias: it favors long documents. In this article we introduce two novel sampling algorithms: a lexicon-based algorithm and a random walk algorithm. Our algorithms produce biased samples, but each sample is accompanied by a weight, which represents its bias. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to four well-known Monte Carlo simulation methods: rejection sampling, importance sampling, the Metropolis--Hastings algorithm, and the Maximum Degree method. The limited access to search engines force our algorithms to use bias weights that are only “approximate”. We characterize analytically the effect of approximate bias weights on Monte Carlo methods and conclude that our algorithms are guaranteed to produce near-uniform samples from the search engine's corpus. Our study of approximate Monte Carlo methods could be of independent interest. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long documents. We use our algorithms to collect comparative statistics about the corpora of the Google, MSN Search, and Yahoo! search engines.