CLADE '03 Proceedings of the 1st International Workshop on Challenges of Large Applications in Distributed Environments
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Cooperative checkpointing: a robust approach to large-scale systems reliability
Proceedings of the 20th annual international conference on Supercomputing
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Proactive fault tolerance for HPC with Xen virtualization
Proceedings of the 21st annual international conference on Supercomputing
Modeling the Impact of Checkpoints on Next-Generation Systems
MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
Bad Words: Finding Faults in Spirit's Syslogs
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Proceedings of the 2nd workshop on System-level virtualization for high performance computing
Methodologies for advance warning of compute cluster problems via statistical analysis: a case study
Proceedings of the 2009 workshop on Resiliency in high performance
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
A study of a KVM-based cluster for grid computing
Proceedings of the 47th Annual Southeast Regional Conference
Hi-index | 0.02 |
Accurate failure prediction in conjunction with efficient process migration facilities including some Cloud constructs can enable failure avoidance in large-scale high performance computing (HPC) platforms. In this work we demonstrate a prototype system that incorporates our probabilistic failure prediction system with virtualization mechanisms and techniques to provide a whole system approach to failure avoidance. This work utilizes a failure scenario based on a real-world HPC case study.