Sample sizes for query probing in uncooperative distributed information retrieval

Authors:
Milad Shokouhi;Falk Scholer;Justin Zobel
Affiliations:
School of Computer Science and Information Technology, RMIT University, Melbourne, Australia;School of Computer Science and Information Technology, RMIT University, Melbourne, Australia;School of Computer Science and Information Technology, RMIT University, Melbourne, Australia
Venue:
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Year:
2006

Citing 19
Cited 10

Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic discovery of language models for text databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Comparing the performance of database selection algorithms

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
GlOSS: text-source discovery over the Internet

ACM Transactions on Database Systems (TODS)
Server selection on the World Wide Web

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Real life, real users, and real needs: a study and analysis of user queries on the web

Information Processing and Management: an International Journal
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Building efficient and effective metasearch engines

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Server Ranking for Distributed Text Retrieval Systems on the Internet

Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Comparing the performance of collection selection algorithms

ACM Transactions on Information Systems (TOIS)
Engineering a multi-purpose test collection for web retrieval experiments

Information Processing and Management: an International Journal
Collection selection for managed distributed document databases

Information Processing and Management: an International Journal
When one sample is not enough: improving text database selection using shrinkage

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Classifying and searching hidden-web text databases

Classifying and searching hidden-web text databases
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
The automatic creation of literature abstracts

IBM Journal of Research and Development

Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Using query logs to establish vocabularies in distributed information retrieval

Information Processing and Management: an International Journal
Federated text retrieval from uncooperative overlapped collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Robust result merging using sample-based score estimates

ACM Transactions on Information Systems (TOIS)
A Topic-Based Measure of Resource Description Quality for Distributed Information Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Classification-based resource selection

Proceedings of the 18th ACM conference on Information and knowledge management
Exploiting peer relations for distributed multimedia information retrieval

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Central-rank-based collection selection in uncooperative distributed information retrieval

ECIR'07 Proceedings of the 29th European conference on IR research
Federated Search

Foundations and Trends in Information Retrieval
Adaptive query-based sampling of distributed collections

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of distributed information retrieval is to support effective searching over multiple document collections. For efficiency, queries should be routed to only those collections that are likely to contain relevant documents, so it is necessary to first obtain information about the content of the target collections. In an uncooperative environment, query probing — where randomly-chosen queries are used to retrieve a sample of the documents and thus of the lexicon — has been proposed as a technique for estimating statistical term distributions. In this paper we rebut the claim that a sample of 300 documents is sufficient to provide good coverage of collection terms. We propose a novel sampling strategy and experimentally demonstrate that sample size needs to vary from collection to collection, that our methods achieve good coverage based on variable-sized samples, and that we can use the results of a probe to determine when to stop sampling.