Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Implementing fault-tolerant services using the state machine approach: a tutorial
ACM Computing Surveys (CSUR)
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Monitors, messages, and clusters: the p4 parallel programming system
Parallel Computing - Special issue: message passing interfaces
On the relevance of communication costs of rollback-recovery protocols
Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Trade-offs in implementing causal message logging protocols
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
MPI: The Complete Reference
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
A message system supporting fault tolerance
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
A Non-Blocking Recovery Algorithm for Causal Message Logging
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
How to recover efficiently and asynchronously when optimism fails
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Distributed system fault tolerance using message logging and checkpointing
Distributed system fault tolerance using message logging and checkpointing
Causality tracking in causal message-logging protocols
Distributed Computing
A causal message logging protocol for mobile nodes in mobile computing systems
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
A New Approach for High Performance Computing Systems with Various Checkpointing Schemes
The Journal of Supercomputing
Performance analysis of different checkpointing and recovery schemes using stochastic model
Journal of Parallel and Distributed Computing
Finding a suitable checkpoint and recovery protocol for a distributed application
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
A weighted checkpointing protocol for mobile distributed systems
International Journal of Ad Hoc and Ubiquitous Computing
Agent based dynamic recovery protocol in distributed databases
ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Performance evaluation of parallel systems employing roll-forward checkpoint schemes
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
Performance evaluation of consistent recovery protocols using MPICH-GF
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Future Generation Computer Systems
Hi-index | 0.00 |
Past research in message logging has focused on studying the relative overhead imposed by pessimistic, optimistic, and causal protocols during failure-free executions. In this paper, we give the first experimental evaluation of the performance of these protocols during recovery. Our results suggest that applications face a complex trade-off when choosing a message logging protocol for fault tolerance. On the one hand, optimistic protocols can provide fast failure-free execution and good performance during recovery, but are complex to implement and can create orphan processes. On the other hand, orphan-free protocols either risk being slow during recovery, e.g., sender-based pessimistic and causal protocols, or incur a substantial overhead during failure-free execution, e.g., receiver-based pessimistic protocols. To address this trade-off, we propose hybrid logging protocols, a new class of orphan-free protocols. We show that hybrid protocols perform within two percent of causal logging during failure-free execution and within two percent of receiver-based logging during recovery.