Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

Authors:
James Brandt;Frank Chen;Vincent De Sapio;Ann Gentile;Jackson Mayo;Philippe Pébay;Diana Roe;David Thompson;Matthew Wong
Affiliations:
-;-;-;-;-;-;-;-;-
Venue:
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Year:
2010

Citing 13
Cited 0

A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform

CLADE '03 Proceedings of the 1st International Workshop on Challenges of Large Applications in Distributed Environments
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Cooperative checkpointing: a robust approach to large-scale systems reliability

Proceedings of the 20th annual international conference on Supercomputing
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
Modeling the Impact of Checkpoints on Next-Generation Systems

MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
Bad Words: Finding Faults in Spirit's Syslogs

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Effects of virtualization on a scientific application running a hyperspectral radiative transfer code on virtual machines

Proceedings of the 2nd workshop on System-level virtualization for high performance computing
Methodologies for advance warning of compute cluster problems via statistical analysis: a case study

Proceedings of the 2009 workshop on Resiliency in high performance
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
A study of a KVM-based cluster for grid computing

Proceedings of the 47th Annual Southeast Regional Conference

Quantified Score

Hi-index	0.02

Visualization

Abstract

Accurate failure prediction in conjunction with efficient process migration facilities including some Cloud constructs can enable failure avoidance in large-scale high performance computing (HPC) platforms. In this work we demonstrate a prototype system that incorporates our probabilistic failure prediction system with virtualization mechanisms and techniques to provide a whole system approach to failure avoidance. This work utilizes a failure scenario based on a real-world HPC case study.