Designing programs that check their work
Journal of the ACM (JACM)
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
IEEE Transactions on Parallel and Distributed Systems
Memory exclusion: optimizing the performance of checkpointing systems
Software—Practice & Experience
Self-stabilizing systems in spite of distributed control
Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Basic Concepts and Taxonomy of Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing
Assessing Fault Sensitivity in MPI Applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
BlueGene/L Failure Analysis and Prediction Models
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable diskless checkpointing for large parallel systems
Scalable diskless checkpointing for large parallel systems
Using fault injection and modeling to evaluate the performability of cluster-based services
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
What Supercomputers Say: A Study of Five System Logs
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A tunable holistic resiliency approach for high-performance computing systems
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Percu: a holistic method for evaluating high performance computing systems
Percu: a holistic method for evaluating high performance computing systems
Proactive fault tolerance in MPI applications via task migration
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Super-Scalable algorithms for computing on 100,000 processors
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
Stochastic computing: embracing errors in architectureand design of processors and applications
CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Run-through stabilization: an MPI proposal for process fault tolerance
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Modeling and tolerating heterogeneous failures in large parallel systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Analyzing fault aware collective performance in a process fault tolerant MPI
Parallel Computing
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Can checkpoint/restart mechanisms benefit from hierarchical data staging?
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Fault tolerant parallel data-intensive algorithms
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Fault tolerant preconditioned conjugate gradient for sparse linear system solution
Proceedings of the 26th ACM international conference on Supercomputing
Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
High performance checksum computation for fault-tolerant MPI over infiniband
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
An evaluation of user-level failure mitigation support in MPI
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming model extensions for resilience in extreme scale computing
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
User level failure mitigation in MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Reliable scalable symbolic computation: the design of SymGridPar2
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Performance comparison under failures of MPI and MapReduce: An analytical approach
Future Generation Computer Systems
Exploring DRAM organizations for energy-efficient and resilient exascale memories
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
The Journal of Supercomputing
Verifying quantitative reliability for programs that execute on unreliable hardware
Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Bespoke physics for living technology
Artificial Life
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
A study of application-level recovery methods for transient network faults
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
A block-asynchronous relaxation method for graphics processing units
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Over the past few years resilience has became a major issue for high-performance computing (HPC) systems, in particular in the perspective of large petascale systems and future exascale systems. These systems will typically gather from half a million to several millions of central processing unit (CPU) cores running up to a billion threads. From the current knowledge and observations of existing large systems, it is anticipated that exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint/ restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system. This set of projections leaves the community of fault tolerance for HPC systems with a difficult challenge: finding new approaches, which are possibly radically disruptive, to run applications until their normal termination, despite the essentially unstable nature of exascale systems. Yet, the community has only five to six years to solve the problem. This white paper synthesizes the motivations, observations and research issues considered as determinant of several complimentary experts of HPC in applications, programming models, distributed systems and system management.