Failure-aware resource management for high-availability computing clusters with distributed virtual machines

  • Authors:
  • Song Fu

  • Affiliations:
  • Department of Computer Science, New Mexico Institute of Mining and Technology, Socorro NM, USA

  • Venue:
  • Journal of Parallel and Distributed Computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In large-scale networked computing systems, component failures become norms instead of exceptions. Failure-aware resource management is crucial for enhancing system availability and achieving high performance. In this paper, we study how to efficiently utilize system resources for high-availability computing with the support of virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for networked computing systems. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes' reliability states. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors in node selection, and propose Best-fit algorithms with optimistic and pessimistic selection strategies to find the best qualified nodes on which to instantiate VMs to run user jobs. We have conducted experiments using failure traces from production systems and the NAS Parallel Benchmark programs on a real-world cluster system. The results show the enhancement of system productivity by using the proposed strategies with practically achievable accuracy of failure prediction. With the Best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster. The task completion rate reaches 91.7% with 83.6% utilization of relatively unreliable nodes.