Message Logging: Pessimistic, Optimistic, Causal, and Optimal

Authors:
Lorenzo Alvisi;Keith Marzullo
Affiliations:
-;-
Venue:
IEEE Transactions on Software Engineering
Year:
1998

Citing 15
Cited 45

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Reliable communication in the presence of failures

ACM Transactions on Computer Systems (TOCS)
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
The causal ordering abstraction and a simple way to implement it

Information Processing Letters
Lightweight causal and atomic group multicast

ACM Transactions on Computer Systems (TOCS)
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Understanding the message logging paradigm for masking process crashes

Understanding the message logging paradigm for masking process crashes
Trade-offs in implementing causal message logging protocols

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Byzantine generals in action: implementing fail-stop processors

ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Notes on Data Base Operating Systems

Operating Systems, An Advanced Course
A message system supporting fault tolerance

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Publishing: a reliable broadcast communication mechanism

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Efficient algorithms for optimistic crash recovery

Distributed Computing

Staggered Consistent Checkpointing

IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing

The Journal of Supercomputing
The Cost of Recovery in Message Logging Protocols

IEEE Transactions on Knowledge and Data Engineering
Design and Analysis of an Integrated Checkpointing and Recovery Scheme for Distributed Applications

IEEE Transactions on Knowledge and Data Engineering
Process Recovery in Heterogeneous Systems

IEEE Transactions on Computers
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing in Message Passing Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
An Efficient Coordinated Checkpointing Scheme Based on PWD Model

ICOIN '02 Revised Papers from the International Conference on Information Networking, Wireless Communications Technologies and Network Applications-Part II
Scalable Causal Message Logging for Wide-Area Environments

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
A Recovery Technique Using Multi-agent in Distributed Computing Systems

COORDINATION '02 Proceedings of the 5th International Conference on Coordination Models and Languages
State Synchronization and Recovery for Strongly Consistent Replicated CORBA Objects

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
An Adaptive Checkpointing Protocol to Bound Recovery Time with Message Logging

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Efficient damage assessment and repair in resilient distributed database systems

Das'01 Proceedings of the fifteenth annual working conference on Database and application security
Causality tracking in causal message-logging protocols

Distributed Computing
Recovery in the Mobile Wireless Environment Using Mobile Agents

IEEE Transactions on Mobile Computing
A causal message logging protocol for mobile nodes in mobile computing systems

Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Replication for web hosting systems

ACM Computing Surveys (CSUR)
Replication for web hosting systems

ACM Computing Surveys (CSUR)
Event Logging: Portable and Efficient Checkpointing in Heterogeneous Environments with Non-FIFO Communication Platforms

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Design, Analysis and Performance Evaluation of a New Algorithm for Developing a Fault Tolerant Distributed System

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Checkpointing and rollback-recovery protocol integrated with VsSG protocol for RYW session guarantee

PDCN'06 Proceedings of the 24th IASTED international conference on Parallel and distributed computing and networks
ExecRecorder: VM-based full-system replay for attack analysis and system recovery

Proceedings of the 1st workshop on Architectural and system support for improving software dependability
Declarative failure recovery for sensor networks

Proceedings of the 6th international conference on Aspect-oriented software development
Quasi-atomic recovery for distributed agents

Parallel Computing
Peer-to-Peer and fault-tolerance: Towards deployment-based technical services

Future Generation Computer Systems
Implementing causal logging using OrbixWeb interception

COOTS'99 Proceedings of the 5th conference on USENIX Conference on Object-Oriented Technologies & Systems - Volume 5
Dependability evaluation of dedicated server group orphan detection method

ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
Preventing of burst traffic in DSG method

ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
Improvement of DSG method

AMCOS'05 Proceedings of the 4th WSEAS International Conference on Applied Mathematics and Computer Science
Active Optimistic Message Logging for Reliable Execution of MPI Applications

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Message fragment based causal message logging

Journal of Parallel and Distributed Computing
Towards Zero-Delay Recovery of Agents in Production Automation Systems

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 02
Improving message logging protocols scalability through distributed event logging

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Damage assessment and repair in attack resilient distributed database systems

Computer Standards & Interfaces
Checkpointing and rollback-recovery protocol for mobile systems with MW session guarantee

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Agent based dynamic recovery protocol in distributed databases

ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
FRASystem: fault tolerant system using agents in distributed computing systems

Cluster Computing
Self-refined fault tolerance in HPC using dynamic dependent process groups

IWDC'05 Proceedings of the 7th international conference on Distributed Computing
A fault-tolerant multi-agent development framework

ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Alleviating scalability issues of checkpointing protocols

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
The viability of using compression to decrease message log sizes

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Journal of Parallel and Distributed Computing
Orphan-Free Consistent Condition for Log-Based Checkpointing and Rollback Recovery Scheme

International Journal of Advanced Pervasive and Ubiquitous Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Message-logging protocols are an integral part of a popular technique for implementing processes that can recover from crash failures. All message-logging protocols require that, when recovery is complete, there be no orphan processes, which are surviving processes whose states are inconsistent with the recovered state of a crashed process. We give a precise specification of the consistency property "no orphan processes." From this specification, we describe how different existing classes of message-logging protocols (namely optimistic, pessimistic, and a class that we call causal) implement this property. We then propose a set of metrics to evaluate the performance of message-logging protocols, and characterize the protocols that are optimal with respect to these metrics. Finally, starting from a protocol that relies on causal delivery order, we show how to derive optimal causal protocols that tolerate f overlapping failures and recoveries for a parameter f : 1 驴f驴n.