Partial Collection Replication for Information Retrieval

Authors:
Zhihong Lu;Kathryn S. McKinley
Affiliations:
AT&T Laboratories, 200 Laurel Avenue, Middletown, New Jersey 07748, USA. zhihonglu@att.com;Department of Computer Sciences, University of Texas, Austin, Texas 78712, USA.
Venue:
Information Retrieval
Year:
2003

Citing 0
Cited 4

A reliable storage management layer for distributed information retrieval systems

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Load balancing for term-distributed parallel retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
PMAX: tenant placement in multitenant databases for profit maximization

Proceedings of the 16th International Conference on Extending Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. This paper shows how to exploit locality by building, using, and searching partial replicas of text collections in a distributed IR system. In this work, a partial replica includes a subset of the documents from larger collection(s) and the corresponding inference network search mechanism. For each query, the distributed system determines if partial replica is a good match and then searches it, or it searches the original collection. We demonstrate the scenarios where partial replication performs better than systems that use caches which only store previous query and answer pairs. We first use logs from THOMAS and Excite to examine query locality using query similarity versus exact match. We show that searching replicas can improve locality (from 3 to 19%) over the exact match required by caching. Replicas increase locality because they satisfy queries which are distinct but return the same or very similar answers. We then present a novel inference network replica selection function. We vary its parameters and compare it to previous collection selection functions, demonstrating a configuration that directs most of the appropriate queries to replicas in a replica hierarchy. We then explore the performance of partial replication in a distributed IR system. We compare it with caching and partitioning. Our validated simulator shows that the increases in locality due to replication make it preferable to caching alone, and that even a small increase of 4% in locality translates into a performance advantage. We also show a hybrid system with caches and replicas that performs better than each on their own.