A case study of distributed information retrieval architectures to index one terabyte of text

Authors:
Fidel Cacheda;Vassilis Plachouras;Iadh Ounis
Affiliations:
Department of Information and Communication Technologies, Facultad de Informática, University of A Coruña, Campus de Elviña s/n, 15071 A Coruña, Spain;Department of Computing Science, University of Glasgow, Glasgow G12 8QQ, UK;Department of Computing Science, University of Glasgow, Glasgow G12 8QQ, UK
Venue:
Information Processing and Management: an International Journal
Year:
2005

Citing 16
Cited 10

Prototyping a distributed information retrieval system that uses statistical ranking

Information Processing and Management: an International Journal
Parallelizing I/O intensive applications for a workstation cluster: a case study

ACM SIGARCH Computer Architecture News - Special issue on input/output in parallel computer systems
An analysis of performance and cost factors in searching large text databases using parallel search systems

Journal of the American Society for Information Science
Inverted File Partitioning Schemes in Multiple Disk Systems

IEEE Transactions on Parallel and Distributed Systems
Performance evaluation of a distributed architecture for information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A design of a distributed full text retrieval system

Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval
Query performance for tightly coupled distributed digital libraries

Proceedings of the third ACM conference on Digital libraries
Methods for information server selection

ACM Transactions on Information Systems (TOIS)
Retrieval performance of a distributed text database utilizing a parallel processor document server

DPDS '90 Proceedings of the second international symposium on Databases in parallel and distributed systems
Partial collection replication versus caching for information retrieval systems

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Performance of inverted indices in shared-nothing distributed text document informatioon retrieval systems

PDIS '93 Proceedings of the second international conference on Parallel and distributed information systems
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Spatial information retrieval and geographical ontologies an overview of the SPIRIT project

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
From E-Sex to E-Commerce: Web Search Changes

Computer
Scalable Text Retrieval for Large Digital Libraries

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Engineering a multi-purpose test collection for web retrieval experiments

Information Processing and Management: an International Journal

Integrating Proximity to Subjective Sentences for Blog Opinion Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Enabling portability in advanced information-centric services over structured peer-to-peer systems

Journal of Network and Computer Applications
Blog track research at TREC

ACM SIGIR Forum
Load and storage balanced posting file partitioning for parallel information retrieval

Journal of Systems and Software
A cascade ranking model for efficient ranked retrieval

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Network analysis for distributed information retrieval architectures

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
MapReduce indexing strategies: Studying scalability and efficiency

Information Processing and Management: an International Journal
Information Retrieval on the Blogosphere

Foundations and Trends in Information Retrieval
Capacity planning for vertical search engines: an approach based on coloured petri nets

PETRI NETS'12 Proceedings of the 33rd international conference on Application and Theory of Petri Nets
When big data leads to lost data

Proceedings of the 5th Ph.D. workshop on Information and knowledge

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasing number of documents to be indexed in many environments (Web, intranets, digital libraries) and the limitations of a single centralised index (lack of scalability, server overloading and failures), lead to the use of distributed information retrieval systems to efficiently search and locate the desired information. This work is a case study of different architectures for a distributed information retrieval system, in order to provide a guide to approximate the optimal architecture with a specific set of resources. We analyse the effectiveness of a distributed, replicated and clustered architecture simulating a variable number of workstations (from 1 up to 4096). A collection of approximately 94 million documents and 1 terabyte (TB) of text is used to test the performance of the different architectures. In a purely distributed information retrieval system, the brokers become the bottleneck due to the high number of local answer sets to be sorted. In a replicated system, the network is the bottleneck due to the high number of query servers and the continuous data interchange with the brokers. Finally, we demonstrate that a clustered system will outperform a replicated system if a high number of query servers is used, essentially due to the reduction of the network load. However a change in the distribution of the users' queries could reduce the performance of a clustered system.