Evaluating the performance of distributed architectures for information retrieval using a variety of workloads

  • Authors:
  • Brendon Cahoon;Kathryn S. McKinley;Zhihong Lu

  • Affiliations:
  • Univ. of Massachusetts, Amherst, MA;Univ. of Massachusetts, Amherst, MA;Village Networks, Hazlet, NJ

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

The information explosion across the Internet and elswhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this article, we explore how to achieve scalable performance in a distributed system for collection sizes ranging from 1GB to 128GB. We implement a fully functional distributed IR system based on a multithreaded version of the Inquery simulation model. We measure performance as a function of system parameters such as client command rate, number of document collections, ter ms per query, query term frequency, number of answers returned, and command mixture. Our results show that it is important to model both query and document commands because the heterogeneity of commands significantly impacts performance. Based on our results, we recommend simple changes to the prototype and evaluate the changes using the simulator. Because of the significant resource demands of information retrieval, it is not difficult to generate workloads that overwhelm system resources regardless of the architecture. However under some realistic workloads, we demonstrate system organizations for which response time gracefully degrades as the workload increases and performance scales with the number of processors. This scalable architecture includes a surprisingly small number of brokers through which a large number of clients and servers communicate.