Evaluating the performance of distributed architectures for information retrieval using a variety of workloads

Authors:
Brendon Cahoon;Kathryn S. McKinley;Zhihong Lu
Affiliations:
Univ. of Massachusetts, Amherst, MA;Univ. of Massachusetts, Amherst, MA;Village Networks, Hazlet, NJ
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2000

Citing 29
Cited 25

Performance analysis of several back-end database architectures

ACM Transactions on Database Systems (TODS)
Parallel free-text search on the connection machine system

Communications of the ACM - Special issue on parallelism
R* optimizer validation and performance evaluation for local queries

SIGMOD '86 Proceedings of the 1986 ACM SIGMOD international conference on Management of data
Strategies for building distributed information retrieval systems

Information Processing and Management: an International Journal
Data cashing in IR systems

SIGIR '87 Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval
Performance modeling of distributed object-oriented database systems

DPDS '88 Proceedings of the first international symposium on Databases in parallel and distributed systems
A parallel indexed algorithm for information retrieval

SIGIR '89 Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval
A case study of caching strategies for a distributed full text retrieval system

Information Processing and Management: an International Journal
Parallel text searching in serial files using a processor farm

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Data caching strategies for distributed full text retrieval systems

Information Systems
Prototyping a distributed information retrieval system that uses statistical ranking

Information Processing and Management: an International Journal
On the allocation of documents in multiprocessor information retrieval systems

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Parallel database systems: the future of high performance database systems

Communications of the ACM
Applying informetric characteristics of databases to IR system file design, Part I: informetric models

Information Processing and Management: an International Journal - Special issue on Informetrics
Parallelizing I/O intensive applications for a workstation cluster: a case study

ACM SIGARCH Computer Architecture News - Special issue on input/output in parallel computer systems
An analysis of performance and cost factors in searching large text databases using parallel search systems

Journal of the American Society for Information Science
Distributed queries and incremental updates in information retrieval systems

Distributed queries and incremental updates in information retrieval systems
Inverted File Partitioning Schemes in Multiple Disk Systems

IEEE Transactions on Parallel and Distributed Systems
TREC and TIPSTER experiments with INQUERY

TREC-2 Proceedings of the second conference on Text retrieval conference
Dissemination of collection wide information in a distributed information retrieval system

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Learning collection fusion strategies

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Performance evaluation of a distributed architecture for information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval performance of a distributed text database utilizing a parallel processor document server

DPDS '90 Proceedings of the second international symposium on Databases in parallel and distributed systems
Performance of inverted indices in shared-nothing distributed text document informatioon retrieval systems

PDIS '93 Proceedings of the second international conference on Parallel and distributed information systems
Distributed Database Systems

Distributed Database Systems
GAMMA - A High Performance Dataflow Database Machine

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Scalable Text Retrieval for Large Digital Libraries

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
The Hardware/Software Balancing Act for Information Retrieval on Symmetric Multiprocessors

Euro-Par '98 Proceedings of the 4th International Euro-Par Conference on Parallel Processing

Partial collection replication versus caching for information retrieval systems

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Performance Analysis of a Distributed Question/Answering System

IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of a Distributed Question/Answering System

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Optimizing result prefetching in web search engines with segmented indices

ACM Transactions on Internet Technology (TOIT)
A cost-oriented approach for infrastructural design

Proceedings of the 2004 ACM symposium on Applied computing
A content model for evaluating peer-to-peer searching techniques

Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware
Guiding queries to information sources with InfoBeacons

Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Query-driven document partitioning and collection selection

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Load balancing for term-distributed parallel retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient in-memory extensible inverted file

Information Systems
A pipelined architecture for distributed text query evaluation

Information Retrieval
Pruning policies for two-tiered inverted index with correctness guarantee

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Optimizing result prefetching in web search engines with segmented indices

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Optimized query execution in large search engines with global page ordering

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Cost minimization in the design of IT infrastructures

SEPADS'06 Proceedings of the 5th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems
Mining query logs to optimize index partitioning in parallel web search engines

Proceedings of the 2nd international conference on Scalable information systems
An optimal overlay topology for routing peer-to-peer searches

Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
Using information retrieval techniques to route queries in an infobeacons network

DBISP2P'04 Proceedings of the Second international conference on Databases, Information Systems, and Peer-to-Peer Computing
A multi-model algorithm for the cost-oriented design of Internet-based systems

Information Sciences: an International Journal
An optimal overlay topology for routing peer-to-peer searches

Middleware'05 Proceedings of the ACM/IFIP/USENIX 6th international conference on Middleware
Capacity planning for vertical search engines: an approach based on coloured petri nets

PETRI NETS'12 Proceedings of the 33rd international conference on Application and Theory of Petri Nets
Shard ranking and cutoff estimation for topically partitioned collections

Proceedings of the 21st ACM international conference on Information and knowledge management
PMAX: tenant placement in multitenant databases for profit maximization

Proceedings of the 16th International Conference on Extending Database Technology
Modelling Search Engines Performance Using Coloured Petri Nets

Fundamenta Informaticae - Application and Theory of Petri Nets and Concurrency, 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

The information explosion across the Internet and elswhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this article, we explore how to achieve scalable performance in a distributed system for collection sizes ranging from 1GB to 128GB. We implement a fully functional distributed IR system based on a multithreaded version of the Inquery simulation model. We measure performance as a function of system parameters such as client command rate, number of document collections, ter ms per query, query term frequency, number of answers returned, and command mixture. Our results show that it is important to model both query and document commands because the heterogeneity of commands significantly impacts performance. Based on our results, we recommend simple changes to the prototype and evaluate the changes using the simulator. Because of the significant resource demands of information retrieval, it is not difficult to generate workloads that overwhelm system resources regardless of the architecture. However under some realistic workloads, we demonstrate system organizations for which response time gracefully degrades as the workload increases and performance scales with the number of processors. This scalable architecture includes a surprisingly small number of brokers through which a large number of clients and servers communicate.