A case study of distributed information retrieval architectures to index one terabyte of text

  • Authors:
  • Fidel Cacheda;Vassilis Plachouras;Iadh Ounis

  • Affiliations:
  • Department of Information and Communication Technologies, Facultad de Informática, University of A Coruña, Campus de Elviña s/n, 15071 A Coruña, Spain;Department of Computing Science, University of Glasgow, Glasgow G12 8QQ, UK;Department of Computing Science, University of Glasgow, Glasgow G12 8QQ, UK

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The increasing number of documents to be indexed in many environments (Web, intranets, digital libraries) and the limitations of a single centralised index (lack of scalability, server overloading and failures), lead to the use of distributed information retrieval systems to efficiently search and locate the desired information. This work is a case study of different architectures for a distributed information retrieval system, in order to provide a guide to approximate the optimal architecture with a specific set of resources. We analyse the effectiveness of a distributed, replicated and clustered architecture simulating a variable number of workstations (from 1 up to 4096). A collection of approximately 94 million documents and 1 terabyte (TB) of text is used to test the performance of the different architectures. In a purely distributed information retrieval system, the brokers become the bottleneck due to the high number of local answer sets to be sorted. In a replicated system, the network is the bottleneck due to the high number of query servers and the continuous data interchange with the brokers. Finally, we demonstrate that a clustered system will outperform a replicated system if a high number of query servers is used, essentially due to the reduction of the network load. However a change in the distribution of the users' queries could reduce the performance of a clustered system.