Partial Collection Replication for Information Retrieval TITLE2:

  • Authors:
  • Z. Lu;K. S. McKinley

  • Affiliations:
  • -;-

  • Venue:
  • Partial Collection Replication for Information Retrieval TITLE2:
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. In this paper, we investigate using partial replication to search a terabyte of text in our distributed IR system. We use a replica selection database to direct queries to relevant replicas that maintain query effectiveness, but at the same time restricts some searches to a small percentage of data to improve performance and scalability, and to reduce network latency. We first investigate query locality with respect to time and replica size using real logs from THOMAS and Excite. Our evidence indicates that there is sufficient query locality to justify partial replication for information retrieval and partial replication can achieve better performance than caching queries, because the replica selection algorithm finds similarity between non-identical queries, and thus increases observed locality. We then extend the inference network model to rank and select partial replicas. We compare our new selection algorithm to previous work on collection selection over a range of tuning parameters. For a given query, our replica selection algorithm correctly determines the most relevant of the replicas or original collection, and thus maintains the highest retrieval effectiveness while searching the least amount of data as compared with the other ranking functions. We use a validated simulator to report on performance of partial collection replication as a function of locality. We compare collection partitioning to partial replication with load balancing, and find partial replication is much more effective at decreasing query response time than partitioning, even with significantly {\em fewer} resources, and it requires only modest query locality. We also demonstrate the average query response time under 10 seconds for a variety of work loads with partial replication on a terabyte of text.