Online checkpointing with improved worst-case guarantees

Authors:
Karl Bringmann;Benjamin Doerr;Adrian Neumann;Jakub Sliacan
Affiliations:
Max Planck Institute for Informatics, Saarbrücken, Germany;Max Planck Institute for Informatics, Saarbrücken, Germany;Max Planck Institute for Informatics, Saarbrücken, Germany;Max Planck Institute for Informatics, Saarbrücken, Germany
Venue:
ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part I
Year:
2013

Citing 8
Cited 0

On the optimum checkpoint selection problem

SIAM Journal on Computing
On the Optimum Checkpoint Interval

Journal of the ACM (JACM)
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Sensornet Checkpointing: Enabling Repeatability in Testbeds and Realism in Simulations

EWSN '09 Proceedings of the 6th European Conference on Wireless Sensor Networks
Rollback and Recovery Strategies for Computer Programs

IEEE Transactions on Computers
Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud

CLOUD '10 Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing
New Algorithms for Optimal Online Checkpointing

SIAM Journal on Scientific Computing
Online checkpointing for parallel adjoint computation in PDEs: application to goal-oriented adaptivity and flow control

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the online checkpointing problem, the task is to continuously maintain a set of k checkpoints that allow to rewind an ongoing computation faster than by a full restart. The only operation allowed is to replace an old checkpoint by the current state. Our aim are checkpoint placement strategies that minimize rewinding cost, i.e., such that at all times T when requested to rewind to some time t≤T the number of computation steps that need to be redone to get to t from a checkpoint before t is as small as possible. In particular, we want that the closest checkpoint earlier than t is not further away from t than qk times the ideal distance T / (k+1), where qk is a small constant. Improving over earlier work showing 1+1/k≤qk≤2, we show that qk can be chosen asymptotically less than 2. We present algorithms with asymptotic discrepancy qk≤1.59+o(1) valid for all k and qk≤ln (4)+o(1)≤1.39+o(1) valid for k being a power of two. Experiments indicate the uniform bound pk≤1.7 for all k. For small k, we show how to use a linear programming approach to compute good checkpointing algorithms. This gives discrepancies of less than 1.55 for all k We prove the first lower bound that is asymptotically more than one, namely qk≥1.30−o(1). We also show that optimal algorithms (yielding the infimum discrepancy) exist for all k.