Adaptive and Virtual Reconfigurations for Effective Dynamic Job Scheduling in Cluster Systems

  • Authors:
  • Songqing Chen;Li Xiao;Xiaodong Zhang

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
  • Year:
  • 2002

Quantified Score

Hi-index 0.02

Visualization

Abstract

In a cluster system with dynamic load sharing support, a job submission or migration to a workstation is determined by the availability of CPU and memory resources of the workstation at the time [3]. In such a system, a small number of running jobs with unexpectedly large memory allocation requirements may significantly increase the queuing delay times of the rest of jobs with normal memory requirements, slowing down executions of individual jobs and decreasing the system throughput. We call this phenomenon as the job blocking problem because the big jobs block the execution pace of majority jobs in the cluster. Since the memory demand of jobs may not be known in advance and may change dynamically, the possibility of unsuitable job sub-missions/migrations to cause the blocking problem is high, andthe existing load sharing schemes are unable to effectively handle this problem. We propose a software method incorporating with dynamic load sharing, which adaptively reserves a smallset of workstations through virtual cluster reconfiguration to provide special services to the jobs demanding large memory allocations. This policy implies the principle ofhortest-remaining-processing-time policy. As soon as the blocking problem is resolved by the reconfiguration, the system will adaptively switch back to the normal load sharing state. We present three contributions in this study. (1) we quantitatively present the conditionsto cause the job blocking problem. (2) We present the adaptive software method in a dynamic load sharing system. We show the adaptive process causes little additional overhead. (3) Conducting trace-driven simulations, we show that our method can effectively improve the cluster computing performance by quickly resolving the job blocking problem. The effectiveness and performance insights are also analytically verified.