On Coordinated Checkpointing in Distributed Systems

Authors:
Guohong Cao;Mukesh Singhal
Affiliations:
Ohio State Univ., Colombus;Ohio State Univ., Columbus
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1998

Citing 17
Cited 34

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
On distributed snapshots

Information Processing Letters
Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System

IEEE Transactions on Software Engineering
Checkpointing and rollback-recovery algorithms in distributed systems

Journal of Systems and Software - Special issue on fault tolerance in real-time systems
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Checkpointing distributed applications on mobile computers

PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
An Efficient Protocol for Checkpointing Recovery in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Finding Consistent Global Checkpoints in a Distributed Computation

IEEE Transactions on Parallel and Distributed Systems
Performance of Consistent Checkpointing in a Modular Operating System: Results of the FTM Experiment

EDCC-1 Proceedings of the First European Dependable Computing Conference on Dependable Computing
Concurrent Robust Checkpointing and Recovery in Distributed Systems

Proceedings of the Fourth International Conference on Data Engineering
Experimental Evaluation of Concurrency Checkpointing and Rollback-Recovery Algorithms

Proceedings of the Sixth International Conference on Data Engineering
Maximum and minimum consistent global checkpoints and their applications

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems

Mutable checkpoints: a new checkpointing approach for mobile computing systems

Proceedings of the eighteenth annual ACM symposium on Principles of distributed computing
Scalable fault-tolerant distributed shared memory

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
On improving the performance of cache invalidation in mobile environments

Mobile Networks and Applications
A Roll-Forward Recovery Scheme for Solving the Problem of Coasting Forward for Distributed Systems

ACM SIGOPS Operating Systems Review
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory

IEEE Transactions on Parallel and Distributed Systems
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory

IEEE Transactions on Parallel and Distributed Systems
Checkpointing with mutable checkpoints

Theoretical Computer Science - Dependable computing
Distributed Checkpointing on Clusters with Dynamic Striping and Staggering

ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
An Efficient Coordinated Checkpointing Scheme Based on PWD Model

ICOIN '02 Revised Papers from the International Conference on Information Networking, Wireless Communications Technologies and Network Applications-Part II
Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks

Journal of Parallel and Distributed Computing
Performance analysis of different checkpointing and recovery schemes using stochastic model

Journal of Parallel and Distributed Computing
Design, Analysis and Performance Evaluation of a New Algorithm for Developing a Fault Tolerant Distributed System

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Quasi-atomic recovery for distributed agents

Parallel Computing
Self-stabilizing algorithm for checkpointing in a distributed system

Journal of Parallel and Distributed Computing
A Lightweight Heuristic-based Mechanism for Collecting Committed Consistent Global States in Optimistic Simulation

DS-RT '07 Proceedings of the 11th IEEE International Symposium on Distributed Simulation and Real-Time Applications
A synchronous checkpointing protocol for mobile distributed systems: probabilistic approach

International Journal of Information and Computer Security
Data-stream-based global event monitoring using pairwise interactions

Journal of Parallel and Distributed Computing
A novel non-block synchronous checkpointing scheme for distributed systems

ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
A low-cost hybrid coordinated checkpointing protocol for mobile distributed systems

Mobile Information Systems
JACEP2P-V2: A Fully Decentralized and Fault Tolerant Environment for Executing Parallel Iterative Asynchronous Applications on Volatile Distributed Architectures

GPC '09 Proceedings of the 4th International Conference on Advances in Grid and Pervasive Computing
A novel low-overhead recovery approach for distributed systems

Journal of Computer Systems, Networks, and Communications
Performance evaluation of the striped checkpointing algorithm on the distributed RAID for cluster computer

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartII
A consistent checkpointing-recovery protocol for minimal number of nodes in mobile computing system

HiPC'07 Proceedings of the 14th international conference on High performance computing
Domino-effect free crash recovery for concurrent failures in cluster federation

GPC'08 Proceedings of the 3rd international conference on Advances in grid and pervasive computing
JACEP2P-V2: A fully decentralized and fault tolerant environment for executing parallel iterative asynchronous applications on volatile distributed architectures

Future Generation Computer Systems
Understanding Checkpointing Overheads on Massive-Scale Systems: Analysis of the IBM Blue Gene/P System

International Journal of High Performance Computing Applications
New & efficient low overheads algorithm for mobile distributed systems

Proceedings of the International Conference & Workshop on Emerging Trends in Technology
New & efficient low overheads algorithm for mobile distributed systems

Proceedings of the International Conference & Workshop on Emerging Trends in Technology
A proxy based efficient checkpointing scheme for fault recovery in mobile grid system

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
A fault-tolerant multi-agent development framework

ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
A low-overhead non-block checkpointing algorithm for mobile computing environment

GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
Exploring reliability of exascale systems through simulations

Proceedings of the High Performance Computing Symposium
Orphan-Free Consistent Condition for Log-Based Checkpointing and Rollback Recovery Scheme

International Journal of Advanced Pervasive and Ubiquitous Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints, the other is to make the checkpointing process nonblocking. These two approaches were orthogonal in previous years until the Prakash-Singhal algorithm [18] combined them. In other words, the Prakash-Singhal algorithm forces only a minimum number of processes to take checkpoints and it does not block the underlying computation. However, we found two problems in this algorithm. In this paper, we identify these problems and prove a more general result: There does not exist a nonblocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on this general result, we propose an efficient algorithm that neither forces all processes to take checkpoints nor blocks the underlying computation during checkpointing. Also, we point out future research directions in designing coordinated checkpointing algorithms for distributed computing systems.