On the Optimal Total Processing Time Using Checkpoints
IEEE Transactions on Software Engineering
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
A first order approximation to the optimum checkpoint interval
Communications of the ACM
A Variational Calculus Approach to Optimal Checkpoint Placement
IEEE Transactions on Computers
Autonomic Web-Based Simulation
ANSS '05 Proceedings of the 38th annual Symposium on Simulation
Reliability challenges in large systems
Future Generation Computer Systems
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Reliability challenges in large systems
Future Generation Computer Systems
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Fault oblivious high performance computing with dynamic task replication and substitution
Computer Science - Research and Development
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Checkpoint scheduling model for optimality
Information Processing Letters
Exploring reliability of exascale systems through simulations
Proceedings of the High Performance Computing Symposium
Automatic identification of application I/O signatures from noisy server-side traces
FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
As the run time of an application approaches the the mean time to interrupt (MTTI) for the system on which it is running, it becomes necessary to generate intermediate snapshots of the application's run state, known as checkpoint files or restart dumps. In the event of a system failure that halts program execution, these snapshots allow an application to resume computing from the most recently saved intermediate state instead of starting over at the beginning of the calculation. In this paper three models for predicting the optimum compute intervals between restart dumps are discussed. These models are evaluated by comparing their results to a simulation that emulate an application running on a actual system with interrupts. The results will be used to derive a simple method for calculating the optimum restart interval.