A first order approximation to the optimum checkpoint interval
Communications of the ACM
BProc: the Beowulf distributed process space
ICS '02 Proceedings of the 16th international conference on Supercomputing
Characterization of Bandwidth-Aware Meta-Schedulers for Co-Allocating Jobs Across Multiple Clusters
The Journal of Supercomputing
Application Resilience: Making Progress in Spite of Failure
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Concurrency and Computation: Practice & Experience - Special Issue: Advanced Strategies in Grid Environments
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Checkpointing strategies for parallel jobs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Application monitoring and checkpointing in HPC: looking towards exascale systems
Proceedings of the 50th Annual Southeast Regional Conference
iSPD: an iconic-based modeling simulator for distributed grids
Proceedings of the 45th Annual Simulation Symposium
Comparing checkpoint and rollback recovery schemes in a cluster system
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
When is multi-version checkpointing needed?
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Hi-index | 0.00 |
As computational clusters rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved computational performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in parallel systems is the use of checkpointing. By making use of a multi-cluster simulator, we study the impact of sub-optimal checkpoint intervals on overall application efficiency. By using a model of a 1926 node cluster and workload statistics from Los Alamos National Laboratory to parameterize the simulator, we find that dramatically overestimating the AMTTI has a fairly minor impact on application efficiency while potentially having a much more severe impact on user-centric performance metrics such a queueing delay. We compare and contrast these results with the trends predicted by an analytical model.