Recovery in distributed systems using asynchronous message logging and checkpointing
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Implementing fault-tolerant services using the state machine approach: a tutorial
ACM Computing Surveys (CSUR)
A generalized birthday problem
SIAM Review
Fast parallel algorithms for short-range molecular dynamics
Journal of Computational Physics
Hive: fault containment for shared-memory multiprocessors
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Random Graphs 93 Proceedings of the sixth international seminar on Random graphs and probabilistic methods in combinatorics and computer science
Fundamentals of fault-tolerant distributed computing in asynchronous environments
ACM Computing Surveys (CSUR)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Practical byzantine fault tolerance and proactive recovery
ACM Transactions on Computer Systems (TOCS)
Predictive performance and scalability modeling of a large-scale application
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
The architecture of Tandem's NonStop system
ACM '81 Proceedings of the ACM '81 conference
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
An Experimental Study about Diskless Checkpointing
EUROMICRO '98 Proceedings of the 24th Conference on EUROMICRO - Volume 1
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Modeling the Impact of Checkpoints on Next-Generation Systems
MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
2-step algorithm for enhancing effectiveness of sender-based message logging
SpringSim '07 Proceedings of the 2007 spring simulation multiconference - Volume 2
An analysis of clustered failures on large supercomputing systems
Journal of Parallel and Distributed Computing
International Journal of High Performance Computing Applications
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Transparent redundant computing with MPI
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Proceedings of the 6th international workshop on Virtualization Technologies in Distributed Computing Date
Data-driven fault tolerance for work stealing computations
Proceedings of the 26th ACM international conference on Supercomputing
Optimizing latency and throughput for spawning processes on massively multicore processors
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Evaluating operating system vulnerability to memory errors
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Alleviating scalability issues of checkpointing protocols
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Detection and correction of silent data corruption for large-scale high-performance computing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
The viability of using compression to decrease message log sizes
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Replication for send-deterministic MPI HPC applications
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Using unreliable virtual hardware to inject errors in extreme-scale systems
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
When is multi-version checkpointing needed?
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Evaluating the feasibility of using memory content similarity to improve system resilience
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds
Journal of Parallel and Distributed Computing
Optimizing process creation and execution on multi-core architectures
International Journal of High Performance Computing Applications
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A 'cool' way of improving the reliability of HPC machines
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ACR: automatic checkpoint/restart for soft and hard error protection
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Post-failure recovery of MPI communication capability: Design and rationale
International Journal of High Performance Computing Applications
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Multi-criteria checkpointing strategies: response-time versus resource utilization
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Evaluating energy savings for checkpoint/restart
E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Checkpointing algorithms and fault prediction
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
As high-end computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are increasingly problematic at these scales due to excessive overheads predicted to more than double an application's time to solution. Replicated computing techniques, particularly state machine replication, long used in distributed and mission critical systems, have been suggested as an alternative to checkpoint-restart. In this paper, we evaluate the viability of using state machine replication as the primary fault tolerance mechanism for upcoming exascale systems. We use a combination of modeling, empirical analysis, and simulation to study the costs and benefits of this approach in comparison to checkpoint/restart on a wide range of system parameters. These results, which cover different failure distributions, hardware mean time to failures, and I/O bandwidths, show that state machine replication is a potentially useful technique for meeting the fault tolerance demands of HPC applications on future exascale platforms.