The Cost of Recovery in Message Logging Protocols

Authors:
Sriram Rao;Lorenzo Alvisi;Harrick M. Vin
Affiliations:
-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2000

Citing 14
Cited 10

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Monitors, messages, and clusters: the p4 parallel programming system

Parallel Computing - Special issue: message passing interfaces
On the relevance of communication costs of rollback-recovery protocols

Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Trade-offs in implementing causal message logging protocols

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
MPI: The Complete Reference

MPI: The Complete Reference
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
A message system supporting fault tolerance

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
A Non-Blocking Recovery Algorithm for Causal Message Logging

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
How to recover efficiently and asynchronously when optimism fails

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Distributed system fault tolerance using message logging and checkpointing

Distributed system fault tolerance using message logging and checkpointing

Causality tracking in causal message-logging protocols

Distributed Computing
A causal message logging protocol for mobile nodes in mobile computing systems

Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
A New Approach for High Performance Computing Systems with Various Checkpointing Schemes

The Journal of Supercomputing
Performance analysis of different checkpointing and recovery schemes using stochastic model

Journal of Parallel and Distributed Computing
Finding a suitable checkpoint and recovery protocol for a distributed application

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
A weighted checkpointing protocol for mobile distributed systems

International Journal of Ad Hoc and Ubiquitous Computing
Agent based dynamic recovery protocol in distributed databases

ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Performance evaluation of parallel systems employing roll-forward checkpoint schemes

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
Performance evaluation of consistent recovery protocols using MPICH-GF

EDCC'05 Proceedings of the 5th European conference on Dependable Computing
HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Past research in message logging has focused on studying the relative overhead imposed by pessimistic, optimistic, and causal protocols during failure-free executions. In this paper, we give the first experimental evaluation of the performance of these protocols during recovery. Our results suggest that applications face a complex trade-off when choosing a message logging protocol for fault tolerance. On the one hand, optimistic protocols can provide fast failure-free execution and good performance during recovery, but are complex to implement and can create orphan processes. On the other hand, orphan-free protocols either risk being slow during recovery, e.g., sender-based pessimistic and causal protocols, or incur a substantial overhead during failure-free execution, e.g., receiver-based pessimistic protocols. To address this trade-off, we propose hybrid logging protocols, a new class of orphan-free protocols. We show that hybrid protocols perform within two percent of causal logging during failure-free execution and within two percent of receiver-based logging during recovery.