Load and storage balanced posting file partitioning for parallel information retrieval

Authors:
Yung-Cheng Ma;Chung-Ping Chung;Tien-Fu Chen
Affiliations:
Department of Computer Science and Information Engineering, Chang-Gung University, Kwei-Shan, Tao-Yuan, Taiwan;Department of Computer Science and Information Engineering, National Chiao-Tung University, Hsinchu, Taiwan;Department of Computer Science and Information Engineering, National Chung-Cheng University, Chiayi, Taiwan
Venue:
Journal of Systems and Software
Year:
2011

Citing 26
Cited 2

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Introduction to parallel computing: design and analysis of algorithms

Introduction to parallel computing: design and analysis of algorithms
Allocating data and workload among multiple servers in a local area network

Information Systems
Inverted File Partitioning Schemes in Multiple Disk Systems

IEEE Transactions on Parallel and Distributed Systems
Popularity-based assignment of movies to storage devices in a video-on-demand system

Multimedia Systems
High-performance sorting on networks of workstations

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Searching for the sorting record: experiences in tuning NOW-Sort

SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
File Assignment in Parallel I/O Systems with Minimal Variance of Service Time

IEEE Transactions on Computers
Comparative Models of the File Assignment Problem

ACM Computing Surveys (CSUR)
Computer Algorithms: C++

Computer Algorithms: C++
Data Allocation for Multidisk Databases

IEEE Transactions on Knowledge and Data Engineering
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Parallel Generation of Inverted Files for Distributed Text Collections

SCCC '98 Proceedings of the XVIII International Conference of the Chilean Computer Science Society
Parallel Search using Partitioned Inverted Files

SPIRE '00 Proceedings of the Seventh International Symposium on String Processing Information Retrieval (SPIRE'00)
Data Distribution Algorithms For Load Balanced Fault-Tolerant Web Access

SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
Load balancing for term-distributed parallel retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A pipelined architecture for distributed text query evaluation

Information Retrieval
Performance analysis of distributed information retrieval architectures using an improved network simulation model

Information Processing and Management: an International Journal
File Placement on Distributed Computer Systems

Computer
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Mining query logs to optimize index partitioning in parallel web search engines

Proceedings of the 2nd international conference on Scalable information systems
A case study of distributed information retrieval architectures to index one terabyte of text

Information Processing and Management: an International Journal
Blog track research at TREC

ACM SIGIR Forum
MMPacking: a load and storage balancing algorithm for distributed multimedia servers

IEEE Transactions on Circuits and Systems for Video Technology

Fast query evaluation through document identifier assignment for inverted file-based information retrieval systems

Information Processing and Management: an International Journal
A term-based inverted index partitioning model for efficient distributed query processing

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract: Many recent major search engines on Internet use a large-scale cluster to store a large database and cope with high query arrival rate. To design a large scale parallel information retrieval system, both performance and storage cost has to be taken into integrated consideration. Moreover, a quantitative method to design the cluster in systematical way is required. This paper proposes posting file partitioning algorithm for these requirements. The partitioning follows the partition-by-document-ID principle to eliminate communication overhead. The kernel of the partitioning is a data allocation algorithm to allocate variable-sized data items for both load and storage balancing. The data allocation algorithm is proven to satisfy a load balancing constraint with asymptotical 1-optimal storage cost. A probability model is established such that query processing throughput can be calculated from keyword popularities and data allocation result. With these results, we show a quantitative method to design a cluster systematically. This research provides a systematical approach to large-scale information retrieval system design. This approach has the following features: (1) the differences to ideal load balancing and storage balancing are negligible in real-world application. (2) Both load balancing and storage balancing can be taken into integrated consideration without conflicting. (3) The data allocation algorithm is capable to deal with data items of variable-sizes and variable loads. An algorithm having all these features together is never achieved before and is the key factor for achieving load and storage balanced workstation cluster in a real-world environment.