Capturing collection size for distributed non-cooperative retrieval

Authors:
Milad Shokouhi;Justin Zobel;Falk Scholer;S. M. M. Tahaghoghi
Affiliations:
RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia
Venue:
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2006

Citing 20
Cited 25

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Real life, real users, and real needs: a study and analysis of user queries on the web

Information Processing and Management: an International Journal
Overview of the sixth text REtrieval conference (TREC-6)

Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
Probe, count, and classify: categorizing hidden web databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Effective site finding using link anchor information

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
A language modeling framework for resource selection and results merging

Proceedings of the eleventh international conference on Information and knowledge management
Discovering the representative of a search engine

Proceedings of the eleventh international conference on Information and knowledge management
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Comparing the performance of collection selection algorithms

ACM Transactions on Information Systems (TOIS)
Engineering a multi-purpose test collection for web retrieval experiments

Information Processing and Management: an International Journal
Access-ordered indexes

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Collection selection for managed distributed document databases

Information Processing and Management: an International Journal
When one sample is not enough: improving text database selection using shrinkage

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Unified utility maximization framework for resource selection

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Sampling search-engine results

WWW '05 Proceedings of the 14th international conference on World Wide Web
Server selection methods in hybrid portal search

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Federated text retrieval from uncooperative overlapped collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating sampling methods for uncooperative collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Updating collection representations for federated search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating collection size with logistic regression

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A results merging algorithm for distributed information retrieval environments that combines regression methodologies with a selective download phase

Information Processing and Management: an International Journal
Generalising multiple capture-recapture to non-uniform sample sizes

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Efficient estimation of the size of text deep web data source

Proceedings of the 17th ACM conference on Information and knowledge management
Integral based source selection for uncooperative distributed information retrieval environments

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Robust result merging using sample-based score estimates

ACM Transactions on Information Systems (TOIS)
Simple Adaptations of Data Fusion Algorithms for Source Selection

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
SUSHI: scoring scaled samples for server selection

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Effective query expansion for federated search

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Server selection methods in personal metasearch: a comparative empirical study

Information Retrieval
Estimating deep web data source size by capture---recapture method

Information Retrieval
Central-rank-based collection selection in uncooperative distributed information retrieval

ECIR'07 Proceedings of the 29th European conference on IR research
Collection-integral source selection for uncooperative distributed information retrieval environments

Information Sciences: an International Journal
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
Modeling information sources as integrals for effective and efficient source selection

Information Processing and Management: an International Journal
Federated Search

Foundations and Trends in Information Retrieval
Attribute domain discovery for hidden web databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Database Size Estimation by Query Performance -- A Complexity Aspect

UCC '12 Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing
Size estimation of non-cooperative data collections

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Vertical selection in the information domain of children

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Rank discovery from web databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-cooperative environments, where detailed collection statistics are not available, the size of the underlying collections must be estimated. While several approaches for the estimation of collection size have been proposed, their accuracy has not been thoroughly evaluated. An empirical analysis of past estimation approaches across a variety of collections demonstrates that their prediction accuracy is low. Motivated by ecological techniques for the estimation of animal populations, we propose two new approaches for the estimation of collection size. We show that our approaches are significantly more accurate that previous methods, and are more efficient in use of resources required to perform the estimation.