Similarity-based document distribution for efficient distributed information retrieval

  • Authors:
  • Sven Herschel

  • Affiliations:
  • Humboldt-Universitt zu Berlin, Berlin, Germany

  • Venue:
  • WISE'07 Proceedings of the 8th international conference on Web information systems engineering
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Performing information retrieval (IR) efficiently in a distributed environment is currently one of the main challenges in IR. Document representations are distributed among nodes in a manner that allows a query processing algorithm to efficiently direct queries to those nodes that contribute to the result. Existing term-based document distribution algorithms do not scale with large collection sizes or manyterm queries because they incur heavy network traffic during the distribution and query phases. We propose a novel algorithm for document distribution, namely distance-based document distribution. The distribution obtained by our algorithm allows answering any IR query effectively by contacting only a few nodes, independent of both document collection size and network size, thereby improving efficiency. We accomplish this by linearizing the information retrieval search space such that it reflects the ranking formula which will be used for later retrieval. Our experimental evaluation indicates that effective information retrieval can be efficiently accomplished in distributed networks.