Discovering the representative of a search engine
Proceedings of the eleventh international conference on Information and knowledge management
Relevant document distribution estimation method for resource selection
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Random sampling from a search engine's index
Proceedings of the 15th international conference on World Wide Web
Capturing collection size for distributed non-cooperative retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Robust result merging using sample-based score estimates
ACM Transactions on Information Systems (TOIS)
Estimating deep web data source size by capture---recapture method
Information Retrieval
Ranking bias in deep web size estimation using capture recapture method
Data & Knowledge Engineering
Foundations and Trends in Information Retrieval
Size estimation of non-cooperative data collections
Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Vertical selection in the information domain of children
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Hi-index | 0.00 |
Collection size is an important feature to represent the content summaries of a collection, and plays a vital role in collection selection for distributed search. In uncooperative environments, collection size estimation algorithms are adopted to estimate the sizes of collections with their search interfaces. This paper proposes heterogeneous capture (HC) algorithm, in which the capture probabilities of documents are modeled with logistic regression. With heterogeneous capture probabilities, HC algorithm estimates collection size through conditional maximum likelihood. Experimental results on real web data show that our HC algorithm outperforms both multiple capture-recapture and capture history algorithms.