Estimating collection size with logistic regression

Authors:
Jingfang Xu;Sheng Wu;Xing Li
Affiliations:
Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China
Venue:
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2007

Citing 4
Cited 6

Discovering the representative of a search engine

Proceedings of the eleventh international conference on Information and knowledge management
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
Capturing collection size for distributed non-cooperative retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Robust result merging using sample-based score estimates

ACM Transactions on Information Systems (TOIS)
Estimating deep web data source size by capture---recapture method

Information Retrieval
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
Federated Search

Foundations and Trends in Information Retrieval
Size estimation of non-cooperative data collections

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Vertical selection in the information domain of children

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Collection size is an important feature to represent the content summaries of a collection, and plays a vital role in collection selection for distributed search. In uncooperative environments, collection size estimation algorithms are adopted to estimate the sizes of collections with their search interfaces. This paper proposes heterogeneous capture (HC) algorithm, in which the capture probabilities of documents are modeled with logistic regression. With heterogeneous capture probabilities, HC algorithm estimates collection size through conditional maximum likelihood. Experimental results on real web data show that our HC algorithm outperforms both multiple capture-recapture and capture history algorithms.