Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

Authors:
Song Fu
Affiliations:
-
Venue:
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Year:
2009

Citing 22
Cited 3

A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
A Case For Grid Computing On Virtual Machines

ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Virtual Machine Monitors: Current Technology and Future Trends

Computer
Filtering Failure Logs for a BlueGene/L Prototype

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Probabilistic QoS Guarantees for Supercomputing Systems

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A Power-Aware Run-Time System for High-Performance Computing

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
A case for high performance computing with virtual machines

Proceedings of the 20th annual international conference on Supercomputing
Live migration of virtual machines

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
High performance VMM-bypass I/O in virtual machines

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Exploiting availability prediction in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
High performance and scalable I/O virtualization via self-virtualized devices

Proceedings of the 16th international symposium on High performance distributed computing
Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management

SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Performance under failures of high-end computing

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Reliability-aware resource allocation in HPC systems

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Predicting failures of computer systems: a case study for a telecommunication system

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing

Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
Quantifying event correlations for proactive failure management in networked computing systems

Journal of Parallel and Distributed Computing
Randomized load balancing strategies with churn resilience in peer-to-peer networks

Journal of Network and Computer Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In large-scale clusters and computational grids, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs have become an increasingly important concern to system designers and administrators. In this paper, we study how to efficiently utilize system resources for high-availability clusters with the support of the virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for clusters computing. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes' reliability status. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors in node selection, and propose Best-fit algorithms to find the best qualified nodes on which to instantiate VMs to run parallel jobs. We have conducted experiments using failure traces from production clusters and the NAS Parallel Benchmark programs on a real cluster. The results show the enhancement of system productivity and dependability by using the proposed strategies. With the Best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster, and the task completion rate reaches 91.7%.