Performance driven multi-objective distributed scheduling for parallel computations

  • Authors:
  • Ankur Narang;Abhinav Srivastava;Naga Praveen Kumar Katta;Rudrapatna K. Shyamasundar

  • Affiliations:
  • IBM Research - India, New Delhi;IBM Research - India, New Delhi;IBM Research - India, New Delhi;Tata Institute of Fundamental Research, Mumbai

  • Venue:
  • ACM SIGOPS Operating Systems Review
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the advent of many-core architectures and strong need for Petascale (and Exascale) performance in scientific domains and industry analytics, efficient scheduling of parallel computations for higher productivity and performance has become very important. Further, movement of massive amounts (Terabytes to Petabytes) of data is very expensive, which necessitates affinity driven computations. Therefore, distributed scheduling of parallel computations on multiple places 1 needs to optimize multiple performance objectives: follow affinity maximally and ensure efficient space, time and message complexity. Simultaneous consideration of these objectives makes distributed scheduling a particularly challenging problem. In addition, parallel computations have data dependent execution patterns which requires online scheduling to effectively optimize the computation orchestration as it unfolds. This paper presents an online algorithm for affinity driven distributed scheduling of multi-place 2 parallel computations. To optimize multiple performance objectives simultaneously, our algorithm uses a low time and message complexity mechanism for ensuring affinity and a randomized work-stealing mechanism within places for load balancing. Theoretical analysis of the expected and probabilistic lower and upper bounds on time and message complexity of this algorithm has been provided. On multi-core clusters such as Blue Gene/P (MPP architecture) and Intel multicore cluster, we demonstrate performance close to the custom MPI+Pthreads code. Further, strong, weak and data (increasing input data size) scalability have been demonstrated on multi-core clusters. Using well known benchmarks, we demonstrate 16% to 30% performance gain as compared to Cilk [6] on multi-core Intel Xeon 5570 (NUMA) architecture. Detailed experimental analysis illustrates efficient space (main memory) utilization as well. To the best of our knowledge, this is the first time multi-objective affinity driven distributed scheduling algorithm has been designed, theoretically analyzed and experimentally evaluated in a multi-place setup for multi-core cluster architectures.