Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
IEEE Transactions on Parallel and Distributed Systems
Memory exclusion: optimizing the performance of checkpointing systems
Software—Practice & Experience
Self-stabilizing systems in spite of distributed control
Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
BlueGene/L Failure Analysis and Prediction Models
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable diskless checkpointing for large parallel systems
Scalable diskless checkpointing for large parallel systems
What Supercomputers Say: A Study of Five System Logs
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A tunable holistic resiliency approach for high-performance computing systems
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Proactive fault tolerance in MPI applications via task migration
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Distributed Diskless Checkpoint for Large Scale Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Virtualizing high performance computing
ACM SIGOPS Operating Systems Review
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems
ACM Transactions on Architecture and Code Optimization (TACO)
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
FTI: high performance fault tolerance interface for hybrid systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Scientific data services: a high-performance I/O system with array semantics
Proceedings of the first annual workshop on High performance computing meets databases
Algorithm-based fault tolerance for dense matrix factorizations
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Tuple switching network-When slower may be better
Journal of Parallel and Distributed Computing
Developing a power measurement framework for cyber defense
Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop
Future Generation Computer Systems
ACR: automatic checkpoint/restart for soft and hard error protection
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
The Journal of Supercomputing
Post-failure recovery of MPI communication capability: Design and rationale
International Journal of High Performance Computing Applications
Evaluating energy savings for checkpoint/restart
E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Hi-index | 0.00 |
The emergence of petascale systems and the promise of future exascale systems have reinvigorated the community interest in how to manage failures in such systems and ensure that large applications, lasting several hours or tens of hours, are completed successfully. Most of the existing results for several key mechanisms associated with fault tolerance in high-performance computing (HPC) platforms follow the rollbackâ聙聰recovery approach. Over the last decade, these mechanisms have received a lot of attention from the community with different levels of success. Unfortunately, despite their high degree of optimization, existing approaches do not fit well with the challenging evolutions of large-scale systems. There is room and even a need for new approaches. Opportunities may come from different origins: diskless checkpointing, algorithmic-based fault tolerance, proactive operation, speculative execution, software transactional memory, forward recovery, etc. The contributions of this paper are as follows: (1) we summarize and analyze the existing results concerning the failures in large-scale computers and point out the urgent need for drastic improvements or disruptive approaches for fault tolerance in these systems; (2) we sketch most of the known opportunities and analyze their associated limitations; (3) we extract and express the challenges that the HPC community will have to face for addressing the stringent issue of failures in HPC systems.