A tunable holistic resiliency approach for high-performance computing systems

Authors:
Stephen L. Scott;Christian Engelmann;Geoffroy R. Vallée;Thomas Naughton;Anand Tikotekar;George Ostrouchov;Chokchai Leangsuksun;Nichamon Naksinehaboon;Raja Nassar;Mihaela Paun;Frank Mueller;Chao Wang;Arun B. Nagarajan;Jyothish Varma
Affiliations:
Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA;Louisiana Tech University, Ruston, LA, USA;Louisiana Tech University, Ruston, LA, USA;Louisiana Tech University, Ruston, LA, USA;Louisiana Tech University, Ruston, LA, USA;North Carolina State University, Raleigh, NC, USA;North Carolina State University, Raleigh, NC, USA;North Carolina State University, Raleigh, NC, USA;North Carolina State University, Raleigh, NC, USA
Venue:
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2009

Citing 4
Cited 3

Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Evaluation of fault-tolerant policies using simulation

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Proactive Fault Tolerance Using Preemptive Migration

PDP '09 Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing

Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Sequentiality induced by spike number in SNP systems: small universal machines

CMC'11 Proceedings of the 12th international conference on Membrane Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.