Distributed text retrieval from overlapping collections

  • Authors:
  • Milad Shokouhi;Justin Zobel;Yaniv Bernstein

  • Affiliations:
  • RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia

  • Venue:
  • ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In standard text retrieval systems, the documents are gathered and indexed on a single server. In distributed information retrieval (DIR), the documents are held in multiple collections; answers to queries are produced by selecting the collections to query and then merging results from these collections. However, in most prior research in the area, collections are assumed to be disjoint. In this paper, we investigate the effectiveness of different combinations of server selection and result merging algorithms in the presence of duplicates. We also test our hash-based method for efficiently detecting duplicates and near-duplicates in the lists of documents returned by collections. Our results, based on two different designs of test data, indicate that some DIR methods are more likely to return duplicate documents, and show that removing such redundant documents can have a significant impact on the final search effectiveness.