A technique for measuring the relative size and overlap of public Web search engines
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Mirror, mirror on the Web: a study of host pairs with replicated content
WWW '99 Proceedings of the eighth international conference on World Wide Web
Accessibility of information on the Web
intelligence
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
A comparison of techniques to find mirrored hosts on the WWW
Journal of the American Society for Information Science
Query-based sampling of text databases
ACM Transactions on Information Systems (TOIS)
Information Retrieval
Discovering the representative of a search engine
Proceedings of the eleventh international conference on Information and knowledge management
Approximating Aggregate Queries about Web Pages via Random Walks
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Sampling search-engine results
WWW '05 Proceedings of the 14th international conference on World Wide Web
The indexable web is more than 11.5 billion pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Random sampling from a search engine's index
Proceedings of the 15th international conference on World Wide Web
Improving the evaluation of web search systems
ECIR'03 Proceedings of the 25th European conference on IR research
Efficient search engine measurements
Proceedings of the 16th international conference on World Wide Web
Agreeing to disagree: search engines and their public interfaces
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Evaluating sampling methods for uncooperative collections
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Reasoning about similarity queries in text retrieval tasks
Proceedings of the 17th international conference on World Wide Web
Random sampling from a search engine's index
Journal of the ACM (JACM)
Mining search engine query logs via suggestion sampling
Proceedings of the VLDB Endowment
Efficient estimation of the size of text deep web data source
Proceedings of the 17th ACM conference on Information and knowledge management
Automatic retrieval of similar content using search engine query interface
Proceedings of the 18th ACM conference on Information and knowledge management
Estimating deep web data source size by capture---recapture method
Information Retrieval
Unbiased estimation of size and other aggregates over hidden web databases
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ranking bias in deep web size estimation using capture recapture method
Data & Knowledge Engineering
Foundations and Trends in Information Retrieval
On identifying academic homepages for digital libraries
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Efficient Search Engine Measurements
ACM Transactions on the Web (TWEB)
Sampling hidden objects using nearest-neighbor oracles
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
An overview of Web search evaluation methods
Computers and Electrical Engineering
Estimating sum by weighted sampling
ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Database Size Estimation by Query Performance -- A Complexity Aspect
UCC '12 Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing
Size estimation of non-cooperative data collections
Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Document selection for tiered indexing in commerce search
Proceedings of the sixth ACM international conference on Web search and data mining
Hi-index | 0.00 |
We consider the problem of estimating the size of a collection of documents using only a standard query interface. Our main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query from a uniformly sampleable query pool of known size, fixed in advance.Using this basic estimator, we propose two approaches to estimating corpus size. The first approach requires a uniform random sample of documents from the corpus. The second approach avoids this notoriously difficult sample generation problem, and instead uses two fairly uncorrelated sets of terms as query pools; the accuracy of the second approach depends on the degree of correlation among the two sets of terms.Experiments on a large TREC collection and on three major search engines demonstrates the effectiveness of our algorithms.