Random sampling from a search engine's index

Authors:
Ziv Bar-Yossef;Maxim Gurevich
Affiliations:
Technion, Haifa, Israel;Technion, Haifa, Israel
Venue:
Proceedings of the 15th international conference on World Wide Web
Year:
2006

Citing 11
Cited 50

What do we know about the metropolis algorithm?

Journal of Computer and System Sciences
A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding information on the World Wide Web: the retrieval effectiveness of search engines

Information Processing and Management: an International Journal
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Measuring Search Engine Quality

Information Retrieval
Approximating Aggregate Queries about Web Pages via Random Walks

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
The Little Engines That Could: Modeling the Performance of World Wide Web Search Engines

Marketing Science
Automatic performance evaluation of web search engines

Information Processing and Management: an International Journal
Sampling search-engine results

WWW '05 Proceedings of the 14th international conference on World Wide Web
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web

Estimating corpus size via queries

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Efficient search engine measurements

Proceedings of the 16th international conference on World Wide Web
Measuring semantic similarity between words using web search engines

Proceedings of the 16th international conference on World Wide Web
Search engines and their public interfaces: which apis are the most synchronized?

Proceedings of the 16th international conference on World Wide Web
NL sampler: random sampling of web documents based on natural language with query hit estimation

Proceedings of the 2007 ACM symposium on Applied computing
A random walk approach to sampling hidden databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Factors affecting website reconstruction from the web infrastructure

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Agreeing to disagree: search engines and their public interfaces

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Evaluating sampling methods for uncooperative collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating collection size with logistic regression

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Using neighbors to date web documents

Proceedings of the 9th annual ACM international workshop on Web information and data management
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Distinct value estimation on peer-to-peer networks

Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments
Modelling and Mining of Networked Information Spaces

Algorithms and Models for the Web-Graph
Web Structure in 2005

Algorithms and Models for the Web-Graph
Mining search engine query logs via suggestion sampling

Proceedings of the VLDB Endowment
Efficient sampling of information in social networks

Proceedings of the 2008 ACM workshop on Search in social media
Robust result merging using sample-based score estimates

ACM Transactions on Information Systems (TOIS)
Learning to tag

Proceedings of the 18th international conference on World wide web
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
A Topic-Based Measure of Resource Description Quality for Distributed Information Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
On unbiased sampling for unstructured peer-to-peer networks

IEEE/ACM Transactions on Networking (TON)
A framework for describing web repositories

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Privacy preservation of aggregates in hidden databases: why and how?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Brahms: Byzantine resilient random membership sampling

Computer Networks: The International Journal of Computer and Telecommunications Networking
A Device Search Strategy Based on Connections History for Patient Monitoring

IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part II: Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living
Robust estimation of Google counts for social network extraction

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Web Observation from a User Perspective

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Estimating deep web data source size by capture---recapture method

Information Retrieval
Web Crawling

Foundations and Trends in Information Retrieval
Turbo-charging hidden database samplers with overflowing queries and skew reduction

Proceedings of the 13th International Conference on Extending Database Technology
Segmentation of search engine results for effective data-fusion

ECIR'07 Proceedings of the 29th European conference on IR research
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
Federated Search

Foundations and Trends in Information Retrieval
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
On identifying academic homepages for digital libraries

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
A multi-collection latent topic model for federated search

Information Retrieval
A prediction model for web search hit counts using word frequencies

Journal of Information Science
Counting YouTube videos via random prefix sampling

Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference
An overview of Web search evaluation methods

Computers and Electrical Engineering
A framework for utilising usage trends in the crawling and indexing process of search engines

International Journal of Knowledge and Web Intelligence
On measuring the lexical quality of the web

Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality
Aggregate suppression for enterprise search engines

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
To what problem is distributed information retrieval the solution?

Journal of the American Society for Information Science and Technology
Context similarity measure using Fuzzy Formal Concept Analysis

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
Estimating sum by weighted sampling

ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Improving relational similarity measurement using symmetries in proportional word analogies

Information Processing and Management: an International Journal
Size estimation of non-cooperative data collections

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Mining a search engine's corpus without a query pool

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines.The technique of Bharat and Broder suffers from two well recorded biases: it favors long documents and highly ranked documents. In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding "weight", which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm.We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine's index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents. We use our algorithms to collect fresh data about the relative sizes of Google, MSN Search, and Yahoo!.