On the Quality of Service of Failure Detectors
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
The customizable fault/error model for dependable distributed systems
Theoretical Computer Science - Dependable computing
A Case For Grid Computing On Virtual Machines
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Xen and the art of virtualization
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Optimizing the migration of virtual computers
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Definition and Specification of Accrual Failure Detectors
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A Hybrid and Adaptive Model for Fault-Tolerant Distributed Computing
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Filtering Failure Logs for a BlueGene/L Prototype
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Probabilistic QoS Guarantees for Supercomputing Systems
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Journal of Parallel and Distributed Computing
A Power-Aware Run-Time System for High-Performance Computing
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
BlueGene/L Failure Analysis and Prediction Models
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Implementing unreliable failure detectors with unknown membership
Information Processing Letters
A case for high performance computing with virtual machines
Proceedings of the 20th annual international conference on Supercomputing
Live migration of virtual machines
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Exploiting Tuple Spaces to Provide Fault-Tolerant Scheduling on Computational Grids
ISORC '07 Proceedings of the 10th IEEE International Symposium on Object and Component-Oriented Real-Time Distributed Computing
High performance VMM-bypass I/O in virtual machines
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Exploiting availability prediction in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
High performance and scalable I/O virtualization via self-virtualized devices
Proceedings of the 16th international symposium on High performance distributed computing
Proactive fault tolerance for HPC with Xen virtualization
Proceedings of the 21st annual international conference on Supercomputing
A framework for the design of dependent-failure algorithms: Research Articles
Concurrency and Computation: Practice & Experience - Parallel and Distributed Computing (EuroPar 2005)
Reliability and Scheduling on Systems Subject to Failures
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
The Fail-Heterogeneous Architectural Model
SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management
SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
DepSpace: a byzantine fault-tolerant coordination service
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Performance under failures of high-end computing
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A Framework for Proactive Fault Tolerance
ARES '08 Proceedings of the 2008 Third International Conference on Availability, Reliability and Security
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Probabilistic Failure Detection for Efficient Distributed Storage Maintenance
SRDS '08 Proceedings of the 2008 Symposium on Reliable Distributed Systems
Reliability-aware resource allocation in HPC systems
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Predicting failures of computer systems: a case study for a telecommunication system
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Performance implications of failures in large-scale cluster scheduling
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Quantifying event correlations for proactive failure management in networked computing systems
Journal of Parallel and Distributed Computing
Randomized load balancing strategies with churn resilience in peer-to-peer networks
Journal of Network and Computer Applications
Resource reconstruction algorithms for on-demand allocation in virtual computing resource pool
International Journal of Automation and Computing
Resource virtualization methodology for on-demand allocation in cloud computing systems
Service Oriented Computing and Applications
A job submission manager for large-scale distributed systems based on job futurity predictor
International Journal of Grid and Utility Computing
Hi-index | 0.00 |
In large-scale networked computing systems, component failures become norms instead of exceptions. Failure-aware resource management is crucial for enhancing system availability and achieving high performance. In this paper, we study how to efficiently utilize system resources for high-availability computing with the support of virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for networked computing systems. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes' reliability states. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors in node selection, and propose Best-fit algorithms with optimistic and pessimistic selection strategies to find the best qualified nodes on which to instantiate VMs to run user jobs. We have conducted experiments using failure traces from production systems and the NAS Parallel Benchmark programs on a real-world cluster system. The results show the enhancement of system productivity by using the proposed strategies with practically achievable accuracy of failure prediction. With the Best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster. The task completion rate reaches 91.7% with 83.6% utilization of relatively unreliable nodes.