Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Authors:
Song Fu
Affiliations:
Department of Computer Science, New Mexico Institute of Mining and Technology, Socorro NM, USA
Venue:
Journal of Parallel and Distributed Computing
Year:
2010

Citing 42
Cited 5

On the Quality of Service of Failure Detectors

IEEE Transactions on Computers
Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of Applications in Heterogeneous Computing

IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
The customizable fault/error model for dependable distributed systems

Theoretical Computer Science - Dependable computing
A Case For Grid Computing On Virtual Machines

ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Optimizing the migration of virtual computers

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Virtual Machine Monitors: Current Technology and Future Trends

Computer
Definition and Specification of Accrual Failure Detectors

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A Hybrid and Adaptive Model for Fault-Tolerant Distributed Computing

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Filtering Failure Logs for a BlueGene/L Prototype

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Probabilistic QoS Guarantees for Supercomputing Systems

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters

Journal of Parallel and Distributed Computing
A Power-Aware Run-Time System for High-Performance Computing

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Implementing unreliable failure detectors with unknown membership

Information Processing Letters
A case for high performance computing with virtual machines

Proceedings of the 20th annual international conference on Supercomputing
Live migration of virtual machines

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Exploiting Tuple Spaces to Provide Fault-Tolerant Scheduling on Computational Grids

ISORC '07 Proceedings of the 10th IEEE International Symposium on Object and Component-Oriented Real-Time Distributed Computing
High performance VMM-bypass I/O in virtual machines

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Exploiting availability prediction in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
High performance and scalable I/O virtualization via self-virtualized devices

Proceedings of the 16th international symposium on High performance distributed computing
Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
A framework for the design of dependent-failure algorithms: Research Articles

Concurrency and Computation: Practice & Experience - Parallel and Distributed Computing (EuroPar 2005)
Reliability and Scheduling on Systems Subject to Failures

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
The Fail-Heterogeneous Architectural Model

SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management

SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
DepSpace: a byzantine fault-tolerant coordination service

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Model-based performance evaluation of distributed checkpointing protocols

Performance Evaluation
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Performance under failures of high-end computing

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A Framework for Proactive Fault Tolerance

ARES '08 Proceedings of the 2008 Third International Conference on Availability, Reliability and Security
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Probabilistic Failure Detection for Efficient Distributed Storage Maintenance

SRDS '08 Proceedings of the 2008 Symposium on Reliable Distributed Systems
Reliability-aware resource allocation in HPC systems

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Predicting failures of computer systems: a case study for a telecommunication system

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing

Quantifying event correlations for proactive failure management in networked computing systems

Journal of Parallel and Distributed Computing
Randomized load balancing strategies with churn resilience in peer-to-peer networks

Journal of Network and Computer Applications
Resource reconstruction algorithms for on-demand allocation in virtual computing resource pool

International Journal of Automation and Computing
Resource virtualization methodology for on-demand allocation in cloud computing systems

Service Oriented Computing and Applications
A job submission manager for large-scale distributed systems based on job futurity predictor

International Journal of Grid and Utility Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In large-scale networked computing systems, component failures become norms instead of exceptions. Failure-aware resource management is crucial for enhancing system availability and achieving high performance. In this paper, we study how to efficiently utilize system resources for high-availability computing with the support of virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for networked computing systems. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes' reliability states. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors in node selection, and propose Best-fit algorithms with optimistic and pessimistic selection strategies to find the best qualified nodes on which to instantiate VMs to run user jobs. We have conducted experiments using failure traces from production systems and the NAS Parallel Benchmark programs on a real-world cluster system. The results show the enhancement of system productivity by using the proposed strategies with practically achievable accuracy of failure prediction. With the Best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster. The task completion rate reaches 91.7% with 83.6% utilization of relatively unreliable nodes.