A higher order estimate of the optimum checkpoint interval for restart dumps

Authors:
J. T. Daly
Affiliations:
Los Alamos National Laboratory, M/S T080, Los Alamos, NM 87545, USA
Venue:
Future Generation Computer Systems
Year:
2006

Citing 5
Cited 35

On the Optimal Total Processing Time Using Checkpoints

IEEE Transactions on Software Engineering
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

IEEE Transactions on Computers
A first order approximation to the optimum checkpoint interval

Communications of the ACM
A Variational Calculus Approach to Optimal Checkpoint Placement

IEEE Transactions on Computers
A model for predicting the optimum checkpoint interval for restart dumps

ICCS'03 Proceedings of the 2003 international conference on Computational science

Modeling and Analysis of Checkpoint I/O Operations

ASMTA '09 Proceedings of the 16th International Conference on Analytical and Stochastic Modeling Techniques and Applications
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A flexible checkpoint/restart model in distributed systems

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Behavioral simulations in MapReduce

Proceedings of the VLDB Endowment
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

ACM Transactions on Architecture and Code Optimization (TACO)
Understanding Checkpointing Overheads on Massive-Scale Systems: Analysis of the IBM Blue Gene/P System

International Journal of High Performance Computing Applications
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Energy-aware checkpoint intervals in error-prone mobile networks

Proceedings of the 6th International Conference on Queueing Theory and Network Applications
An initial approximation to the resource-optimal checkpoint interval

PaCT'11 Proceedings of the 11th international conference on Parallel computing technologies
Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
System implications of memory reliability in exascale computing

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
SpotMPI: a framework for auction-based HPC computing using amazon spot instances

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
ACM SRC poster: SpotMPI: auction-based high performance cloud computing

Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Application monitoring and checkpointing in HPC: looking towards exascale systems

Proceedings of the 50th Annual Southeast Regional Conference
Checkpoint scheduling model for optimality

Information Processing Letters
Tuple switching network-When slower may be better

Journal of Parallel and Distributed Computing
Alleviating scalability issues of checkpointing protocols

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
On the checkpointing strategy in desktop grids

IDCS'12 Proceedings of the 5th international conference on Internet and Distributed Computing Systems
On the Path to Exascale

International Journal of Distributed Systems and Technologies
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
When is multi-version checkpointing needed?

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Performance comparison under failures of MPI and MapReduce: An analytical approach

Future Generation Computer Systems
Optimization of cloud task processing with checkpoint-restart mechanism

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A 'cool' way of improving the reliability of HPC machines

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ACR: automatic checkpoint/restart for soft and hard error protection

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Multi-criteria checkpointing strategies: response-time versus resource utilization

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Evaluating energy savings for checkpoint/restart

E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Accelerating incremental checkpointing for extreme-scale computing

Future Generation Computer Systems
Checkpointing algorithms and fault prediction

Journal of Parallel and Distributed Computing
Automatic identification of application I/O signatures from noisy server-side traces

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper examines methods of approximating the optimum checkpoint restart strategy for minimizing application run time on a system exhibiting Poisson single component failures. Two different models will be developed and compared. We will begin with a simplified cost function that yields a first-order model. Then we will derive a more complete cost function and demonstrate a perturbation solution that provides accurate high order approximations to the optimum checkpoint interval.