A first order approximation to the optimum checkpoint interval

Authors:
John W. Young
Affiliations:
Martin Marietta Corp., Orlando, FL
Venue:
Communications of the ACM
Year:
1974

Citing 0
Cited 81

Analysis of a Class of Recovery Procedures

IEEE Transactions on Computers
Optimal checkpointing of real-time tasks

IEEE Transactions on Computers
An Experimental Study to Determine Task Size for Rollback Recovery Systems

IEEE Transactions on Computers
Recovery Point Selection on a Reverse Binary Tree Task Model

IEEE Transactions on Software Engineering
Comparative Analysis of Different Models of Checkpointing and Recovery

IEEE Transactions on Software Engineering
Selecting the checkpoint interval in time warp simulation

PADS '93 Proceedings of the seventh workshop on Parallel and distributed simulation
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

IEEE Transactions on Computers
A Case for Two-Level Recovery Schemes

IEEE Transactions on Computers
A practical guide to the design of differential files for recovery of on-line databases

ACM Transactions on Database Systems (TODS)
Optimal policy for batch operations: backup, checkpointing, reorganization, and updating

ACM Transactions on Database Systems (TODS)
On the Optimum Checkpoint Interval

Journal of the ACM (JACM)
Checkpointing strategies for database systems

CSC '87 Proceedings of the 15th annual conference on Computer Science
Fault Tolerant Operating Systems

ACM Computing Surveys (CSUR)
Performance analysis of checkpointing strategies

ACM Transactions on Computer Systems (TOCS)
Optimization criteria for checkpoint placement

Communications of the ACM
Optimization criteria for checkpoint placement

Communications of the ACM
Performance of rollback recovery systems under intermittent failures

Communications of the ACM
Analysis of Checkpointing for Real-Time Systems

Real-Time Systems
A Variational Calculus Approach to Optimal Checkpoint Placement

IEEE Transactions on Computers
Dynamic database dumping

SIGMOD '78 Proceedings of the 1978 ACM SIGMOD international conference on management of data
Stochastic Models for Performance Analysis of Database Recovery Control

IEEE Transactions on Computers
Performance Evaluation of a Two Level Error Recovery Scheme for Distributed Systems

IWDC '02 Proceedings of the 4th International Workshop on Distributed Computing, Mobile and Wireless Computing
A model of roll-back recovery with multiple checkpoints

ICSE '76 Proceedings of the 2nd international conference on Software engineering
An architecture for fault tolerance in database systems

ACM '80 Proceedings of the ACM 1980 annual conference
An Adaptive Checkpointing Protocol to Bound Recovery Time with Message Logging

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle

IEEE Transactions on Dependable and Secure Computing
Cooperative checkpointing: a robust approach to large-scale systems reliability

Proceedings of the 20th annual international conference on Supercomputing
Goal-Directed Reasoning for Specification-Based Data Structure Repair

IEEE Transactions on Software Engineering
Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks

IEEE Transactions on Computers
Model-based performance evaluation of distributed checkpointing protocols

Performance Evaluation
Performance under failures of high-end computing

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Bristlecone: A Language for Robust Software Systems

ECOOP '08 Proceedings of the 22nd European conference on Object-Oriented Programming
Analytical study of migration-enhanced fault tolerance for long-running applications in IFR systems

International Journal of Parallel, Emergent and Distributed Systems
Fault-aware scheduling for Bag-of-Tasks applications on Desktop Grids

GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
Numerical computation algorithms for sequential checkpoint placement

Performance Evaluation
Modeling and Analysis of Checkpoint I/O Operations

ASMTA '09 Proceedings of the 16th International Conference on Analytical and Stochastic Modeling Techniques and Applications
Performance under Failures of DAG-based Parallel Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
An analysis of location record checkpointing interval for mobility database in PCS networks

Wireless Networks
A model for predicting the optimum checkpoint interval for restart dumps

ICCS'03 Proceedings of the 2003 international conference on Computational science
Analysis of a software system with rejuvenation, restoration and checkpointing

ISAS'08 Proceedings of the 5th international conference on Service availability
Comprehensive evaluation of aperiodic checkpointing and rejuvenation schemes in operational software system

Journal of Systems and Software
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A flexible checkpoint/restart model in distributed systems

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Dynamic performance prediction of an adaptive mesh application

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

ACM Transactions on Architecture and Code Optimization (TACO)
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Providing resiliency for optical grids by exploiting relocation: A dimensioning study based on ILP

Computer Communications
Energy-aware checkpoint intervals in error-prone mobile networks

Proceedings of the 6th International Conference on Queueing Theory and Network Applications
An initial approximation to the resource-optimal checkpoint interval

PaCT'11 Proceedings of the 11th international conference on Parallel computing technologies
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Modeling and tolerating heterogeneous failures in large parallel systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
System implications of memory reliability in exascale computing

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
SpotMPI: a framework for auction-based HPC computing using amazon spot instances

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing
ACM SRC poster: SpotMPI: auction-based high performance cloud computing

Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Fault-Tolerant scheduling for bag-of-tasks grid applications

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Application monitoring and checkpointing in HPC: looking towards exascale systems

Proceedings of the 50th Annual Southeast Regional Conference
Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Optimal backup policy for a database system with incremental and full backups

Mathematical and Computer Modelling: An International Journal
Checkpoint scheduling model for optimality

Information Processing Letters
A Cost-Effective Mechanism for Cloud Data Reliability Management Based on Proactive Replica Checking

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Design and modeling of a non-blocking checkpointing system

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
On the checkpointing strategy in desktop grids

IDCS'12 Proceedings of the 5th international conference on Internet and Distributed Computing Systems
When is multi-version checkpointing needed?

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Performance comparison under failures of MPI and MapReduce: An analytical approach

Future Generation Computer Systems
Optimization of cloud task processing with checkpoint-restart mechanism

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A 'cool' way of improving the reliability of HPC machines

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
A policy-based approach for strong mobility of composed Web services

Service Oriented Computing and Applications
Checkpointing algorithms and fault prediction

Journal of Parallel and Distributed Computing
Automatic identification of application I/O signatures from noisy server-side traces

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	48.29

Visualization

Abstract

To avoid having to restart a job from the beginning in case of random failure, it is standard practice to save periodically sufficient information to enable the job to be restarted at the previous point at which information was saved. Such points are referred to as checkpoints, and the saving of such information at these points is called checkpointing [1].