Load and storage balanced posting file partitioning for parallel information retrieval

  • Authors:
  • Yung-Cheng Ma;Chung-Ping Chung;Tien-Fu Chen

  • Affiliations:
  • Department of Computer Science and Information Engineering, Chang-Gung University, Kwei-Shan, Tao-Yuan, Taiwan;Department of Computer Science and Information Engineering, National Chiao-Tung University, Hsinchu, Taiwan;Department of Computer Science and Information Engineering, National Chung-Cheng University, Chiayi, Taiwan

  • Venue:
  • Journal of Systems and Software
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Abstract: Many recent major search engines on Internet use a large-scale cluster to store a large database and cope with high query arrival rate. To design a large scale parallel information retrieval system, both performance and storage cost has to be taken into integrated consideration. Moreover, a quantitative method to design the cluster in systematical way is required. This paper proposes posting file partitioning algorithm for these requirements. The partitioning follows the partition-by-document-ID principle to eliminate communication overhead. The kernel of the partitioning is a data allocation algorithm to allocate variable-sized data items for both load and storage balancing. The data allocation algorithm is proven to satisfy a load balancing constraint with asymptotical 1-optimal storage cost. A probability model is established such that query processing throughput can be calculated from keyword popularities and data allocation result. With these results, we show a quantitative method to design a cluster systematically. This research provides a systematical approach to large-scale information retrieval system design. This approach has the following features: (1) the differences to ideal load balancing and storage balancing are negligible in real-world application. (2) Both load balancing and storage balancing can be taken into integrated consideration without conflicting. (3) The data allocation algorithm is capable to deal with data items of variable-sizes and variable loads. An algorithm having all these features together is never achieved before and is the key factor for achieving load and storage balanced workstation cluster in a real-world environment.