Automatic text processing: the transformation, analysis, and retrieval of information by computer
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Information retrieval: data structures and algorithms
Information retrieval: data structures and algorithms
Introduction to parallel computing: design and analysis of algorithms
Introduction to parallel computing: design and analysis of algorithms
Allocating data and workload among multiple servers in a local area network
Information Systems
Inverted File Partitioning Schemes in Multiple Disk Systems
IEEE Transactions on Parallel and Distributed Systems
High-performance sorting on networks of workstations
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Searching for the sorting record: experiences in tuning NOW-Sort
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
File Assignment in Parallel I/O Systems with Minimal Variance of Service Time
IEEE Transactions on Computers
Comparative Models of the File Assignment Problem
ACM Computing Surveys (CSUR)
Computer Algorithms: C++
Data Allocation for Multidisk Databases
IEEE Transactions on Knowledge and Data Engineering
Parallel Generation of Inverted Files for Distributed Text Collections
SCCC '98 Proceedings of the XVIII International Conference of the Chilean Computer Science Society
Parallel Search using Partitioned Inverted Files
SPIRE '00 Proceedings of the Seventh International Symposium on String Processing Information Retrieval (SPIRE'00)
Data Distribution Algorithms For Load Balanced Fault-Tolerant Web Access
SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
Load balancing for term-distributed parallel retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A pipelined architecture for distributed text query evaluation
Information Retrieval
Information Processing and Management: an International Journal
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Mining query logs to optimize index partitioning in parallel web search engines
Proceedings of the 2nd international conference on Scalable information systems
A case study of distributed information retrieval architectures to index one terabyte of text
Information Processing and Management: an International Journal
ACM SIGIR Forum
MMPacking: a load and storage balancing algorithm for distributed multimedia servers
IEEE Transactions on Circuits and Systems for Video Technology
Information Processing and Management: an International Journal
A term-based inverted index partitioning model for efficient distributed query processing
ACM Transactions on the Web (TWEB)
Hi-index | 0.00 |
Abstract: Many recent major search engines on Internet use a large-scale cluster to store a large database and cope with high query arrival rate. To design a large scale parallel information retrieval system, both performance and storage cost has to be taken into integrated consideration. Moreover, a quantitative method to design the cluster in systematical way is required. This paper proposes posting file partitioning algorithm for these requirements. The partitioning follows the partition-by-document-ID principle to eliminate communication overhead. The kernel of the partitioning is a data allocation algorithm to allocate variable-sized data items for both load and storage balancing. The data allocation algorithm is proven to satisfy a load balancing constraint with asymptotical 1-optimal storage cost. A probability model is established such that query processing throughput can be calculated from keyword popularities and data allocation result. With these results, we show a quantitative method to design a cluster systematically. This research provides a systematical approach to large-scale information retrieval system design. This approach has the following features: (1) the differences to ideal load balancing and storage balancing are negligible in real-world application. (2) Both load balancing and storage balancing can be taken into integrated consideration without conflicting. (3) The data allocation algorithm is capable to deal with data items of variable-sizes and variable loads. An algorithm having all these features together is never achieved before and is the key factor for achieving load and storage balanced workstation cluster in a real-world environment.