Publishing: a reliable broadcast communication mechanism

Authors:
Michael L. Powell;David L. Presotto
Affiliations:
Computer Science Division, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA;Computer Science Division, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA
Venue:
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Year:
1983

Citing 15
Cited 53

Reliability mechanisms for SDD-1: a system for distributed databases

ACM Transactions on Database Systems (TODS)
Recovery Techniques for Database Systems

ACM Computing Surveys (CSUR)
Synchronization in Distributed Programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
Ethernet: distributed packet switching for local computer networks

Communications of the ACM
Reliable Computing Systems

Operating Systems, An Advanced Course
Notes on Data Base Operating Systems

Operating Systems, An Advanced Course
Task communication in DEMOS

SOSP '77 Proceedings of the sixth ACM symposium on Operating systems principles
The DEMOS file system

SOSP '77 Proceedings of the sixth ACM symposium on Operating systems principles
Metric (Extended Abstract): A kernel instrumentation system for distributed environments

SOSP '77 Proceedings of the sixth ACM symposium on Operating systems principles
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
A message system supporting fault tolerance

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Process migration in DEMOS/MP

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
MANAGEMENT OF OBJECT HISTORIES IN THE SWALLOW REPOSITORY

MANAGEMENT OF OBJECT HISTORIES IN THE SWALLOW REPOSITORY
RECOVERY OF THE SWALLOW REPOSITORY

RECOVERY OF THE SWALLOW REPOSITORY
Database concurrency control and recovery in local broadcast networks

Database concurrency control and recovery in local broadcast networks

Distributed operating systems

ACM Computing Surveys (CSUR) - The MIT Press scientific computation series
UIO: a uniform I/O system interface for distributed systems

ACM Transactions on Computer Systems (TOCS)
A survey of process migration mechanisms

ACM SIGOPS Operating Systems Review
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using asynchronous message logging and checkpointing

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Preserving and using context information in interprocess communication

ACM Transactions on Computer Systems (TOCS)
Programming languages for distributed computing systems

ACM Computing Surveys (CSUR)
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Demonic memory for process histories

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
The cascade fault tolerance message system

CSC '89 Proceedings of the 17th conference on ACM Annual Computer Science Conference
Failure Transparency in Remote Procedure Calls

IEEE Transactions on Computers
Modeling of Hierarchical Distributed Systems with Fault-Tolerance

IEEE Transactions on Software Engineering
Distributed, object-based programming systems

ACM Computing Surveys (CSUR)
Transparent optimistic rollback recovery

ACM SIGOPS Operating Systems Review
Lightweight causal and atomic group multicast

ACM Transactions on Computer Systems (TOCS)
An abstract model of rollback recovery control in distributed systems

ACM SIGOPS Operating Systems Review
A checkpointing recovery approach in a distributed system on the CSMA/CD network

SAC '92 Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing: technological challenges of the 1990's
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.

IEEE Transactions on Parallel and Distributed Systems
Distributed process groups in the V Kernel

ACM Transactions on Computer Systems (TOCS)
Hypervisor-based fault tolerance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
On the relevance of communication costs of rollback-recovery protocols

Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Hypervisor-based fault tolerance

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Trade-offs in implementing causal message logging protocols

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Optimistic Crash Recovery without Changing Application Messages

IEEE Transactions on Parallel and Distributed Systems
Support for Software Interrupts in Log-Based Rollback-Recovery

IEEE Transactions on Computers
Fast cluster failover using virtual memory-mapped communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Transparent optimistic rollback recovery

EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing

The Journal of Supercomputing
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks

IEEE Transactions on Parallel and Distributed Systems
Repeated Computation of Global Functions in a Distributed Environment

IEEE Transactions on Parallel and Distributed Systems
Efficient Rollback-Recovery Technique in Distributed Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
Fault-Tolerant Parallel Applications Using Queues and Actions

ICPP '97 Proceedings of the international Conference on Parallel Processing
Supporting nondeterministic execution in fault-tolerant systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Process migration in DEMOS/MP

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Replicated procedure call

PODC '84 Proceedings of the third annual ACM symposium on Principles of distributed computing
Garbage collection in message passing distributed systems

PAS '95 Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Causality tracking in causal message-logging protocols

Distributed Computing
Efficient algorithms for optimistic crash recovery

Distributed Computing
Why use a fishing line when you have a net? an adaptive multicast data distribution protocol

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
2-step algorithm for enhancing effectiveness of sender-based message logging

SpringSim '07 Proceedings of the 2007 spring simulation multiconference - Volume 2
Novel log management for sender-based message logging

ICAI'08 Proceedings of the 9th WSEAS International Conference on International Conference on Automation and Information
Lightweight log management algorithm for removing logged messages of sender processes with little overhead

WSEAS Transactions on Computers
Practical and low-overhead masking of failures of TCP-based servers

ACM Transactions on Computer Systems (TOCS)
A novel low-overhead recovery approach for distributed systems

Journal of Computer Systems, Networks, and Communications
Message fragment based causal message logging

Journal of Parallel and Distributed Computing
Checkpointing and communication pattern-neutral algorithm for removing messages logged by senders

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
An efficient algorithm for removing useless logged messages in SBML protocols

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology

Quantified Score

Hi-index	0.01

Visualization

Abstract

Publishing is a model and mechanism for crash recovery in a distributed computing environment. Published communication works for systems connected via a broadcast medium by recording messages transmitted over the network. The recovery mechanism can be completely transparent to the failed process and all processes interacting with it. Although published communication is intended for a broadcast network such as a bus, a ring, or an Ethernet, it can be used in other environments. A recorder reliably stores all messages that are transmitted, as well as checkpoint and recovery information. When it detects a failure, the recorder may restart affected processes from checkpoints. The recorder subsequently resends to each process all messages which were sent to it since the time its checkpoint was taken, while ignoring duplicate messages sent by it. Message-based systems without shared memory can use published communications to recover groups of processes. Simulations show that at least 5 multi-user minicomputers can be supported on a standard Ethernet using a single recorder. The prototype version implemented in DEMOS/MP demonstrates that an error recovery can be transparent to user processes and can be centralized in the network.