What do we know about the metropolis algorithm?
Journal of Computer and System Sciences
A technique for measuring the relative size and overlap of public Web search engines
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding information on the World Wide Web: the retrieval effectiveness of search engines
Information Processing and Management: an International Journal
Measuring index quality using random walks on the Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Measuring Search Engine Quality
Information Retrieval
Approximating Aggregate Queries about Web Pages via Random Walks
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Automatic performance evaluation of web search engines
Information Processing and Management: an International Journal
Sampling search-engine results
WWW '05 Proceedings of the 14th international conference on World Wide Web
The indexable web is more than 11.5 billion pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Estimating corpus size via queries
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Efficient search engine measurements
Proceedings of the 16th international conference on World Wide Web
Measuring semantic similarity between words using web search engines
Proceedings of the 16th international conference on World Wide Web
Search engines and their public interfaces: which apis are the most synchronized?
Proceedings of the 16th international conference on World Wide Web
NL sampler: random sampling of web documents based on natural language with query hit estimation
Proceedings of the 2007 ACM symposium on Applied computing
A random walk approach to sampling hidden databases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Factors affecting website reconstruction from the web infrastructure
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Agreeing to disagree: search engines and their public interfaces
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Evaluating sampling methods for uncooperative collections
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating collection size with logistic regression
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Using neighbors to date web documents
Proceedings of the 9th annual ACM international workshop on Web information and data management
RankMass crawler: a crawler with high personalized pagerank coverage guarantee
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Distinct value estimation on peer-to-peer networks
Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments
Modelling and Mining of Networked Information Spaces
Algorithms and Models for the Web-Graph
Algorithms and Models for the Web-Graph
Mining search engine query logs via suggestion sampling
Proceedings of the VLDB Endowment
Efficient sampling of information in social networks
Proceedings of the 2008 ACM workshop on Search in social media
Robust result merging using sample-based score estimates
ACM Transactions on Information Systems (TOIS)
Proceedings of the 18th international conference on World wide web
Sitemaps: above and beyond the crawl of duty
Proceedings of the 18th international conference on World wide web
A Topic-Based Measure of Resource Description Quality for Distributed Information Retrieval
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
On unbiased sampling for unstructured peer-to-peer networks
IEEE/ACM Transactions on Networking (TON)
A framework for describing web repositories
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Privacy preservation of aggregates in hidden databases: why and how?
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Brahms: Byzantine resilient random membership sampling
Computer Networks: The International Journal of Computer and Telecommunications Networking
A Device Search Strategy Based on Connections History for Patient Monitoring
IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part II: Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living
Robust estimation of Google counts for social network extraction
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Web Observation from a User Perspective
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Estimating deep web data source size by capture---recapture method
Information Retrieval
Foundations and Trends in Information Retrieval
Turbo-charging hidden database samplers with overflowing queries and skew reduction
Proceedings of the 13th International Conference on Extending Database Technology
Segmentation of search engine results for effective data-fusion
ECIR'07 Proceedings of the 29th European conference on IR research
Ranking bias in deep web size estimation using capture recapture method
Data & Knowledge Engineering
Foundations and Trends in Information Retrieval
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
On identifying academic homepages for digital libraries
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
A multi-collection latent topic model for federated search
Information Retrieval
A prediction model for web search hit counts using word frequencies
Journal of Information Science
Counting YouTube videos via random prefix sampling
Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference
An overview of Web search evaluation methods
Computers and Electrical Engineering
A framework for utilising usage trends in the crawling and indexing process of search engines
International Journal of Knowledge and Web Intelligence
On measuring the lexical quality of the web
Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality
Aggregate suppression for enterprise search engines
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
To what problem is distributed information retrieval the solution?
Journal of the American Society for Information Science and Technology
Context similarity measure using Fuzzy Formal Concept Analysis
Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
Estimating sum by weighted sampling
ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Improving relational similarity measurement using symmetries in proportional word analogies
Information Processing and Management: an International Journal
Size estimation of non-cooperative data collections
Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Mining a search engine's corpus without a query pool
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Selecting queries from sample to crawl deep web data sources
Web Intelligence and Agent Systems
Hi-index | 0.00 |
We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines.The technique of Bharat and Broder suffers from two well recorded biases: it favors long documents and highly ranked documents. In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding "weight", which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm.We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine's index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents. We use our algorithms to collect fresh data about the relative sizes of Google, MSN Search, and Yahoo!.