Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters

Authors:
William M. Jones;John T. Daly;Nathan DeBardeleben
Affiliations:
Coastal Carolina University, Conway, SC;ACS, Fort Meade, MD;Los Alamos National Laboratory, Los Alamos, NM
Venue:
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Year:
2010

Citing 6
Cited 5

A first order approximation to the optimum checkpoint interval

Communications of the ACM
BProc: the Beowulf distributed process space

ICS '02 Proceedings of the 16th international conference on Supercomputing
Characterization of Bandwidth-Aware Meta-Schedulers for Co-Allocating Jobs Across Multiple Clusters

The Journal of Supercomputing
Application Resilience: Making Progress in Spite of Failure

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems

Concurrency and Computation: Practice & Experience - Special Issue: Advanced Strategies in Grid Environments
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems

Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Application monitoring and checkpointing in HPC: looking towards exascale systems

Proceedings of the 50th Annual Southeast Regional Conference
iSPD: an iconic-based modeling simulator for distributed grids

Proceedings of the 45th Annual Simulation Symposium
Comparing checkpoint and rollback recovery schemes in a cluster system

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
When is multi-version checkpointing needed?

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale

Quantified Score

Hi-index	0.00

Visualization

Abstract

As computational clusters rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved computational performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in parallel systems is the use of checkpointing. By making use of a multi-cluster simulator, we study the impact of sub-optimal checkpoint intervals on overall application efficiency. By using a model of a 1926 node cluster and workload statistics from Los Alamos National Laboratory to parameterize the simulator, we find that dramatically overestimating the AMTTI has a fairly minor impact on application efficiency while potentially having a much more severe impact on user-centric performance metrics such a queueing delay. We compare and contrast these results with the trends predicted by an analytical model.