Estimating corpus size via queries

Authors:
Andrei Broder;Marcus Fontura;Vanja Josifovski;Ravi Kumar;Rajeev Motwani;Shubha Nabar;Rina Panigrahy;Andrew Tomkins;Ying Xu
Affiliations:
Yahoo! Research, Sunnyvale, CA;Yahoo! Research, Sunnyvale, CA;Yahoo! Research, Sunnyvale, CA;Yahoo! Research, Sunnyvale, CA;Stanford University, Stanford, CA;Stanford University, Stanford, CA;Stanford University, Stanford, CA;Yahoo! Research, Sunnyvale, CA;Stanford University, Stanford, CA
Venue:
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Year:
2006

Citing 13
Cited 20

A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Accessibility of information on the Web

intelligence
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
A comparison of techniques to find mirrored hosts on the WWW

Journal of the American Society for Information Science
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Information Retrieval

Information Retrieval
Discovering the representative of a search engine

Proceedings of the eleventh international conference on Information and knowledge management
Approximating Aggregate Queries about Web Pages via Random Walks

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Sampling search-engine results

WWW '05 Proceedings of the 14th international conference on World Wide Web
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
Improving the evaluation of web search systems

ECIR'03 Proceedings of the 25th European conference on IR research

Efficient search engine measurements

Proceedings of the 16th international conference on World Wide Web
Agreeing to disagree: search engines and their public interfaces

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Evaluating sampling methods for uncooperative collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Reasoning about similarity queries in text retrieval tasks

Proceedings of the 17th international conference on World Wide Web
Random sampling from a search engine's index

Journal of the ACM (JACM)
Mining search engine query logs via suggestion sampling

Proceedings of the VLDB Endowment
Efficient estimation of the size of text deep web data source

Proceedings of the 17th ACM conference on Information and knowledge management
Automatic retrieval of similar content using search engine query interface

Proceedings of the 18th ACM conference on Information and knowledge management
Estimating deep web data source size by capture---recapture method

Information Retrieval
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
Federated Search

Foundations and Trends in Information Retrieval
On identifying academic homepages for digital libraries

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Efficient Search Engine Measurements

ACM Transactions on the Web (TWEB)
Sampling hidden objects using nearest-neighbor oracles

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
An overview of Web search evaluation methods

Computers and Electrical Engineering
Estimating sum by weighted sampling

ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Database Size Estimation by Query Performance -- A Complexity Aspect

UCC '12 Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing
Size estimation of non-cooperative data collections

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Document selection for tiered indexing in commerce search

Proceedings of the sixth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of estimating the size of a collection of documents using only a standard query interface. Our main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query from a uniformly sampleable query pool of known size, fixed in advance.Using this basic estimator, we propose two approaches to estimating corpus size. The first approach requires a uniform random sample of documents from the corpus. The second approach avoids this notoriously difficult sample generation problem, and instead uses two fairly uncorrelated sets of terms as query pools; the accuracy of the second approach depends on the degree of correlation among the two sets of terms.Experiments on a large TREC collection and on three major search engines demonstrates the effectiveness of our algorithms.