On the Optimal Total Processing Time Using Checkpoints
IEEE Transactions on Software Engineering
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
A first order approximation to the optimum checkpoint interval
Communications of the ACM
A Variational Calculus Approach to Optimal Checkpoint Placement
IEEE Transactions on Computers
A model for predicting the optimum checkpoint interval for restart dumps
ICCS'03 Proceedings of the 2003 international conference on Computational science
Modeling and Analysis of Checkpoint I/O Operations
ASMTA '09 Proceedings of the 16th International Conference on Analytical and Stochastic Modeling Techniques and Applications
Pregel: a system for large-scale graph processing
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A flexible checkpoint/restart model in distributed systems
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Behavioral simulations in MapReduce
Proceedings of the VLDB Endowment
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems
ACM Transactions on Architecture and Code Optimization (TACO)
International Journal of High Performance Computing Applications
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
Energy-aware checkpoint intervals in error-prone mobile networks
Proceedings of the 6th International Conference on Queueing Theory and Network Applications
An initial approximation to the resource-optimal checkpoint interval
PaCT'11 Proceedings of the 11th international conference on Parallel computing technologies
Checkpointing strategies for parallel jobs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
System implications of memory reliability in exascale computing
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
SpotMPI: a framework for auction-based HPC computing using amazon spot instances
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
ACM SRC poster: SpotMPI: auction-based high performance cloud computing
Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Application monitoring and checkpointing in HPC: looking towards exascale systems
Proceedings of the 50th Annual Southeast Regional Conference
Checkpoint scheduling model for optimality
Information Processing Letters
Tuple switching network-When slower may be better
Journal of Parallel and Distributed Computing
Alleviating scalability issues of checkpointing protocols
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
On the checkpointing strategy in desktop grids
IDCS'12 Proceedings of the 5th international conference on Internet and Distributed Computing Systems
International Journal of Distributed Systems and Technologies
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
When is multi-version checkpointing needed?
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Performance comparison under failures of MPI and MapReduce: An analytical approach
Future Generation Computer Systems
Optimization of cloud task processing with checkpoint-restart mechanism
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A 'cool' way of improving the reliability of HPC machines
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ACR: automatic checkpoint/restart for soft and hard error protection
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Multi-criteria checkpointing strategies: response-time versus resource utilization
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Evaluating energy savings for checkpoint/restart
E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Accelerating incremental checkpointing for extreme-scale computing
Future Generation Computer Systems
Checkpointing algorithms and fault prediction
Journal of Parallel and Distributed Computing
Automatic identification of application I/O signatures from noisy server-side traces
FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
This paper examines methods of approximating the optimum checkpoint restart strategy for minimizing application run time on a system exhibiting Poisson single component failures. Two different models will be developed and compared. We will begin with a simplified cost function that yields a first-order model. Then we will derive a more complete cost function and demonstrate a perturbation solution that provides accurate high order approximations to the optimum checkpoint interval.