Evaluating sampling methods for uncooperative collections

Authors:
Paul Thomas;David Hawking
Affiliations:
Australian National University, Canberra, Australia;CSIRO ICT Centre, Canberra, Australia
Venue:
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2007

Citing 12
Cited 16

Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Cluster-based language models for distributed retrieval

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
The impact of database selection on distributed searching

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Discovering the representative of a search engine

Proceedings of the tenth international conference on Information and knowledge management
Approximating Aggregate Queries about Web Pages via Random Walks

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
Capturing collection size for distributed non-cooperative retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating corpus size via queries

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management

Generalising multiple capture-recapture to non-uniform sample sizes

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Integral based source selection for uncooperative distributed information retrieval environments

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Robust result merging using sample-based score estimates

ACM Transactions on Information Systems (TOIS)
A Topic-Based Measure of Resource Description Quality for Distributed Information Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Simple Adaptations of Data Fusion Algorithms for Source Selection

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Server selection methods in personal metasearch: a comparative empirical study

Information Retrieval
PISA: Federated Search in P2P Networks with Uncooperative Peers

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Exploiting peer relations for distributed multimedia information retrieval

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Estimating deep web data source size by capture---recapture method

Information Retrieval
Collection-integral source selection for uncooperative distributed information retrieval environments

Information Sciences: an International Journal
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
Modeling information sources as integrals for effective and efficient source selection

Information Processing and Management: an International Journal
Federated Search

Foundations and Trends in Information Retrieval
To what problem is distributed information retrieval the solution?

Journal of the American Society for Information Science and Technology
Vertical selection in the information domain of children

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Merging algorithms for enterprise search

Proceedings of the 18th Australasian Document Computing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many server selection methods suitable for distributed information retrieval applications rely, in the absence of cooperation, on the availability of unbiased samples of documents from the constituent collections. We describe a number of sampling methods which depend only on the normal query-response mechanism of the applicable search facilities. We evaluate these methods on a number of collections typical of a personal metasearch application. Results demonstrate that biases exist for all methods, particularly toward longer documents, and that in some cases these biases can be reduced but not eliminated by choice of parameters.We also introduce a new sampling technique, "multiple queries", which produces samples of similar quality to the best current techniques but with significantly reduced cost.