Analysis of a Class of Recovery Procedures
IEEE Transactions on Computers
Optimal checkpointing of real-time tasks
IEEE Transactions on Computers
An Experimental Study to Determine Task Size for Rollback Recovery Systems
IEEE Transactions on Computers
Recovery Point Selection on a Reverse Binary Tree Task Model
IEEE Transactions on Software Engineering
Comparative Analysis of Different Models of Checkpointing and Recovery
IEEE Transactions on Software Engineering
Selecting the checkpoint interval in time warp simulation
PADS '93 Proceedings of the seventh workshop on Parallel and distributed simulation
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
A Case for Two-Level Recovery Schemes
IEEE Transactions on Computers
A practical guide to the design of differential files for recovery of on-line databases
ACM Transactions on Database Systems (TODS)
Optimal policy for batch operations: backup, checkpointing, reorganization, and updating
ACM Transactions on Database Systems (TODS)
On the Optimum Checkpoint Interval
Journal of the ACM (JACM)
Checkpointing strategies for database systems
CSC '87 Proceedings of the 15th annual conference on Computer Science
Fault Tolerant Operating Systems
ACM Computing Surveys (CSUR)
Performance analysis of checkpointing strategies
ACM Transactions on Computer Systems (TOCS)
Optimization criteria for checkpoint placement
Communications of the ACM
Optimization criteria for checkpoint placement
Communications of the ACM
Performance of rollback recovery systems under intermittent failures
Communications of the ACM
Analysis of Checkpointing for Real-Time Systems
Real-Time Systems
A Variational Calculus Approach to Optimal Checkpoint Placement
IEEE Transactions on Computers
SIGMOD '78 Proceedings of the 1978 ACM SIGMOD international conference on management of data
Stochastic Models for Performance Analysis of Database Recovery Control
IEEE Transactions on Computers
Performance Evaluation of a Two Level Error Recovery Scheme for Distributed Systems
IWDC '02 Proceedings of the 4th International Workshop on Distributed Computing, Mobile and Wireless Computing
A model of roll-back recovery with multiple checkpoints
ICSE '76 Proceedings of the 2nd international conference on Software engineering
An architecture for fault tolerance in database systems
ACM '80 Proceedings of the ACM 1980 annual conference
An Adaptive Checkpointing Protocol to Bound Recovery Time with Message Logging
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Fault tolerant high performance computing by a coding approach
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle
IEEE Transactions on Dependable and Secure Computing
Cooperative checkpointing: a robust approach to large-scale systems reliability
Proceedings of the 20th annual international conference on Supercomputing
Goal-Directed Reasoning for Specification-Based Data Structure Repair
IEEE Transactions on Software Engineering
Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks
IEEE Transactions on Computers
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
Performance under failures of high-end computing
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Bristlecone: A Language for Robust Software Systems
ECOOP '08 Proceedings of the 22nd European conference on Object-Oriented Programming
Analytical study of migration-enhanced fault tolerance for long-running applications in IFR systems
International Journal of Parallel, Emergent and Distributed Systems
Fault-aware scheduling for Bag-of-Tasks applications on Desktop Grids
GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
Numerical computation algorithms for sequential checkpoint placement
Performance Evaluation
Modeling and Analysis of Checkpoint I/O Operations
ASMTA '09 Proceedings of the 16th International Conference on Analytical and Stochastic Modeling Techniques and Applications
Performance under Failures of DAG-based Parallel Computing
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
A model for predicting the optimum checkpoint interval for restart dumps
ICCS'03 Proceedings of the 2003 international conference on Computational science
Analysis of a software system with rejuvenation, restoration and checkpointing
ISAS'08 Proceedings of the 5th international conference on Service availability
Journal of Systems and Software
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A flexible checkpoint/restart model in distributed systems
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Dynamic performance prediction of an adaptive mesh application
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems
ACM Transactions on Architecture and Code Optimization (TACO)
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
Energy-aware checkpoint intervals in error-prone mobile networks
Proceedings of the 6th International Conference on Queueing Theory and Network Applications
An initial approximation to the resource-optimal checkpoint interval
PaCT'11 Proceedings of the 11th international conference on Parallel computing technologies
FTI: high performance fault tolerance interface for hybrid systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Checkpointing strategies for parallel jobs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Modeling and tolerating heterogeneous failures in large parallel systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
System implications of memory reliability in exascale computing
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
SpotMPI: a framework for auction-based HPC computing using amazon spot instances
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
ACM SRC poster: SpotMPI: auction-based high performance cloud computing
Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Fault-Tolerant scheduling for bag-of-tasks grid applications
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Application monitoring and checkpointing in HPC: looking towards exascale systems
Proceedings of the 50th Annual Southeast Regional Conference
Distributed GraphLab: a framework for machine learning and data mining in the cloud
Proceedings of the VLDB Endowment
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Optimal backup policy for a database system with incremental and full backups
Mathematical and Computer Modelling: An International Journal
Checkpoint scheduling model for optimality
Information Processing Letters
A Cost-Effective Mechanism for Cloud Data Reliability Management Based on Proactive Replica Checking
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Design and modeling of a non-blocking checkpointing system
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
On the checkpointing strategy in desktop grids
IDCS'12 Proceedings of the 5th international conference on Internet and Distributed Computing Systems
When is multi-version checkpointing needed?
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Performance comparison under failures of MPI and MapReduce: An analytical approach
Future Generation Computer Systems
Optimization of cloud task processing with checkpoint-restart mechanism
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A 'cool' way of improving the reliability of HPC machines
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
A policy-based approach for strong mobility of composed Web services
Service Oriented Computing and Applications
Checkpointing algorithms and fault prediction
Journal of Parallel and Distributed Computing
Automatic identification of application I/O signatures from noisy server-side traces
FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
Hi-index | 48.29 |
To avoid having to restart a job from the beginning in case of random failure, it is standard practice to save periodically sufficient information to enable the job to be restarted at the previous point at which information was saved. Such points are referred to as checkpoints, and the saving of such information at these points is called checkpointing [1].