Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Deadlock detection in distributed databases
ACM Computing Surveys (CSUR)
Proceedings of the Twenty-First Annual Hawaii International Conference on Software Track
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recoverable Distributed Shared Virtual Memory
IEEE Transactions on Computers
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
High-Availability Computer Systems
Computer
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Consistent global checkpoints based on direct dependency tracking
Information Processing Letters
Space reclamation for uncoordinated checkpointing in message-passing systems
Space reclamation for uncoordinated checkpointing in message-passing systems
Mobile wireless computing: challenges in data management
Communications of the ACM
Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.
IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
A software fault tolerance platform
Practical reusable UNIX software
Checkpointing distributed applications on mobile computers
PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
An algorithm for minimizing roll back cost
PODS '82 Proceedings of the 1st ACM SIGACT-SIGMOD symposium on Principles of database systems
Error Recovery in Shared Memory Multiprocessors Using Private Caches
IEEE Transactions on Parallel and Distributed Systems
Maximum and minimum consistent global checkpoints and their applications
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Message logging: pessimistic, optimistic, and causal
ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
An implementation and performance measurement of the progressive retry technique
IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium
Support for Software Interrupts in Log-Based Rollback-Recovery
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
On Coordinated Checkpointing in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Rollback-dependency trackability: visible characterizations
Proceedings of the eighteenth annual ACM symposium on Principles of distributed computing
Ajents: towards an environment for parallel, distributed and mobile Java applications
JAVA '99 Proceedings of the ACM 1999 conference on Java Grande
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification
IEEE Transactions on Parallel and Distributed Systems
Communication-Induced Determination of Consistent Snapshots
IEEE Transactions on Parallel and Distributed Systems
Tracking immediate predecessors in distributed computations
Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
A Roll-Forward Recovery Scheme for Solving the Problem of Coasting Forward for Distributed Systems
ACM SIGOPS Operating Systems Review
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
A Low-Cost Checkpointing Technique for Distributed Databases
Distributed and Parallel Databases
Finding Consistent Global Checkpoints in a Distributed Computation
IEEE Transactions on Parallel and Distributed Systems
Consistency Issues in Distributed Checkpoints
IEEE Transactions on Software Engineering
Interval consistency of asynchronous distributed computations
Journal of Computer and System Sciences
Computation Slicing: Techniques and Theory
DISC '01 Proceedings of the 15th International Conference on Distributed Computing
Shortcut Replay: A Replay Technique for Debugging Long-Running Parallel Programs
ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
Synergistic Coordination between Software and Hardware Fault Tolerance Techniques
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Evaluating Distributed Checkpointing Protocol
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
On the Minimal Characterization of the Rollback-Dependency Trackability Property
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
On Properties of RDT Communication-Induced Checkpointing Protocols
IEEE Transactions on Parallel and Distributed Systems
On designing direct dependency: based fast recovery algorithms for distributed systems
ACM SIGOPS Operating Systems Review
Quantifying rollback propagation in distributed checkpointing
Journal of Parallel and Distributed Computing
Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks
Journal of Parallel and Distributed Computing
Communication-based prevention of useless checkpoints in distributed computations
Distributed Computing
A novel min-process checkpointing scheme for mobile computing systems
Journal of Systems Architecture: the EUROMICRO Journal
Using Consistent Global Checkpoints to Synchronize Processes in Distributed Simulation
DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages
IEEE Transactions on Dependable and Secure Computing
Performance analysis of different checkpointing and recovery schemes using stochastic model
Journal of Parallel and Distributed Computing
Finding a suitable checkpoint and recovery protocol for a distributed application
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Techniques and applications of computation slicing
Distributed Computing
On the Complexity of Removing Z-Cycles from a Checkpoints and Communication Pattern
IEEE Transactions on Computers
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
A novel non-block synchronous checkpointing scheme for distributed systems
ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
Journal of Parallel and Distributed Computing
Computer Networks: The International Journal of Computer and Telecommunications Networking
ICS'08 Proceedings of the 12th WSEAS international conference on Systems
A novel low-overhead recovery approach for distributed systems
Journal of Computer Systems, Networks, and Communications
A weighted checkpointing protocol for mobile distributed systems
International Journal of Ad Hoc and Ubiquitous Computing
A novel recovery approach for cluster federations
GPC'07 Proceedings of the 2nd international conference on Advances in grid and pervasive computing
VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Domino-effect free crash recovery for concurrent failures in cluster federation
GPC'08 Proceedings of the 3rd international conference on Advances in grid and pervasive computing
A low-overhead non-block checkpointing algorithm for mobile computing environment
GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
Future Generation Computer Systems
A multi-cycle checkpointing protocol that ensures strict 1-rollback
Information Processing Letters
Hi-index | 14.99 |
In this paper, we consider the problem of constructing consistent global checkpoints that contain a given set of checkpoints. We address three important issues related to this problem. First, we define the maximum and minimum consistent global checkpoints containing a set S, and give algorithms to construct them. These algorithms are based on reachability analysis on a rollback-dependency graph. Second, we introduce a concept called "rollback-dependency trackability" that enables this analysis to be performed efficiently for a certain class of checkpoint and communication models. We define the least stringent of these models ("FDAS"), and put it in context with other models defined in the literature. Significant in this is a way to use FDAS to provide efficient rollback recovery for applications that do not satisfy perfect piecewise determinism. Finally, we describe several applications of the theorems and algorithms derived in this paper to demonstrate the capability of our approach to unify, generalize, and extend many previous works.