Toward Exascale Resilience

Authors:
Franck Cappello;Al Geist;Bill Gropp;Laxmikant Kale;Bill Kramer;Marc Snir
Affiliations:
INRIA, LABORATOIRE EN RECHERCHE INFORMATIQUE, FRANCE,;OAK RIDGE NATIONAL LABORATORY, TN, USA;DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF ILLINOISAT URBANA-CHAMPAIGN, USA;DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF ILLINOISAT URBANA-CHAMPAIGN, USA;NERSC, LAWRENCE BERKELEY NATIONAL LABORATORY, IL, USA;DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF ILLINOISAT URBANA-CHAMPAIGN, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2009

Citing 24
Cited 29

Designing programs that check their work

Journal of the ACM (JACM)
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Memory exclusion: optimizing the performance of checkpointing systems

Software—Practice & Experience
Self-stabilizing systems in spite of distributed control

Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
Assessing Fault Sensitivity in MPI Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable diskless checkpointing for large parallel systems

Scalable diskless checkpointing for large parallel systems
SWICH: A Prototype for Efficient Cache-Level Checkpointing and Rollback

IEEE Micro
Using fault injection and modeling to evaluate the performability of cluster-based services

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
What Supercomputers Say: A Study of Five System Logs

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A tunable holistic resiliency approach for high-performance computing systems

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Percu: a holistic method for evaluating high performance computing systems

Percu: a holistic method for evaluating high performance computing systems
Proactive fault tolerance in MPI applications via task migration

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Super-Scalable algorithms for computing on 100,000 processors

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I

Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Stochastic computing: embracing errors in architectureand design of processors and applications

CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Run-through stabilization: an MPI proposal for process fault tolerance

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Modeling and tolerating heterogeneous failures in large parallel systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Analyzing fault aware collective performance in a process fault tolerant MPI

Parallel Computing
Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Can checkpoint/restart mechanisms benefit from hierarchical data staging?

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Fault tolerant parallel data-intensive algorithms

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Fault tolerant preconditioned conjugate gradient for sparse linear system solution

Proceedings of the 26th ACM international conference on Supercomputing
Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Reliability of task graph schedules with transient and fail-stop failures: complexity and algorithms

Journal of Scheduling
Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
High performance checksum computation for fault-tolerant MPI over infiniband

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
An evaluation of user-level failure mitigation support in MPI

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming model extensions for resilience in extreme scale computing

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
User level failure mitigation in MPI

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Reliable scalable symbolic computation: the design of SymGridPar2

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Performance comparison under failures of MPI and MapReduce: An analytical approach

Future Generation Computer Systems
Exploring DRAM organizations for energy-efficient and resilient exascale memories

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing
Verifying quantitative reliability for programs that execute on unreliable hardware

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Bespoke physics for living technology

Artificial Life
Design and implementation of a scalable membership service for supercomputer resiliency-aware runtime

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
A study of application-level recovery methods for transient network faults

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
A block-asynchronous relaxation method for graphics processing units

Journal of Parallel and Distributed Computing
An evaluation of User-Level Failure Mitigation support in MPI

Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over the past few years resilience has became a major issue for high-performance computing (HPC) systems, in particular in the perspective of large petascale systems and future exascale systems. These systems will typically gather from half a million to several millions of central processing unit (CPU) cores running up to a billion threads. From the current knowledge and observations of existing large systems, it is anticipated that exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint/ restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system. This set of projections leaves the community of fault tolerance for HPC systems with a difficult challenge: finding new approaches, which are possibly radically disruptive, to run applications until their normal termination, despite the essentially unstable nature of exascale systems. Yet, the community has only five to six years to solve the problem. This white paper synthesizes the motivations, observations and research issues considered as determinant of several complimentary experts of HPC in applications, programming models, distributed systems and system management.