A message system supporting fault tolerance

Authors:
Anita Borg;Jim Baumbach;Sam Glazer
Affiliations:
Auragen Systems Corporation, 2 Executive Drive, Fort Lee, New Jersey;Auragen Systems Corporation, 2 Executive Drive, Fort Lee, New Jersey;Auragen Systems Corporation, 2 Executive Drive, Fort Lee, New Jersey
Venue:
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Year:
1983

Citing 3
Cited 59

Fault Tolerant Operating Systems

ACM Computing Surveys (CSUR)
Process backup in producer-consumer systems

SOSP '77 Proceedings of the sixth ACM symposium on Operating systems principles
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Distributed operating systems

ACM Computing Surveys (CSUR) - The MIT Press scientific computation series
Progressive transaction recovery in distributed DB/DC systems

IEEE Transactions on Computers - Special Issue on Real-Time Systems
Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing

Computer
A survey of process migration mechanisms

ACM SIGOPS Operating Systems Review
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using asynchronous message logging and checkpointing

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Programming languages for distributed computing systems

ACM Computing Surveys (CSUR)
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Demonic memory for process histories

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
The cascade fault tolerance message system

CSC '89 Proceedings of the 17th conference on ACM Annual Computer Science Conference
Modeling of Hierarchical Distributed Systems with Fault-Tolerance

IEEE Transactions on Software Engineering
Fault-tolerant computing based on Mach

ACM SIGOPS Operating Systems Review
An implementation for small databases with high availability

ACM SIGOPS Operating Systems Review
An annotated bibliography of dependable distributed computing

ACM SIGOPS Operating Systems Review
An abstract model of rollback recovery control in distributed systems

ACM SIGOPS Operating Systems Review
A checkpoint protocol for an entry consistent shared memory system

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
VAXcluster: a closely-coupled distributed system

ACM Transactions on Computer Systems (TOCS)
Hypervisor-based fault tolerance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
On distributed object checkpointing and recovery

Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Hypervisor-based fault tolerance

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Trade-offs in implementing causal message logging protocols

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Optimistic Crash Recovery without Changing Application Messages

IEEE Transactions on Parallel and Distributed Systems
File placement and process assignment due to resource sharing in a distributed system

WSC '85 Proceedings of the 17th conference on Winter simulation
Efficient transparent application recovery in client-server information systems

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Support for Software Interrupts in Log-Based Rollback-Recovery

IEEE Transactions on Computers
Fast cluster failover using virtual memory-mapped communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Replicated distributed programs

Proceedings of the tenth ACM symposium on Operating systems principles
Replication and fault-tolerance in the ISIS system

Proceedings of the tenth ACM symposium on Operating systems principles
The Cost of Recovery in Message Logging Protocols

IEEE Transactions on Knowledge and Data Engineering
An Efficient Protocol for Checkpointing Recovery in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
Masking System Crashes in Database Application Programs

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Quorum-Based Locking Protocol in Nested Invocations of Methods

DEXA '01 Proceedings of the 12th International Conference on Database and Expert Systems Applications
Supporting nondeterministic execution in fault-tolerant systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Recoverable mobile environment: design and trade-off analysis

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Publishing: a reliable broadcast communication mechanism

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Replicated procedure call

PODC '84 Proceedings of the third annual ACM symposium on Principles of distributed computing
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Logging and Recovery in Adaptive Software Distributed Shared Memory Systems

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Quorum-Based Protocol for Locking Replicas of Objects

ICCNMC '01 Proceedings of the 2001 International Conference on Computer Networks and Mobile Computing (ICCNMC'01)
Efficient damage assessment and repair in resilient distributed database systems

Das'01 Proceedings of the fifteenth annual working conference on Database and application security
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Causality tracking in causal message-logging protocols

Distributed Computing
Distributed file systems - a survey

ACM SIGOPS Operating Systems Review
Efficient algorithms for optimistic crash recovery

Distributed Computing
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
DP: a library for building portable, reliable distributed applications

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Transparent fault tolerance for parallel applications on networks of workstations

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Towards an Autonomic Element Architecture for ASSL

SEAMS '07 Proceedings of the 2007 International Workshop on Software Engineering for Adaptive and Self-Managing Systems
Rx: Treating bugs as allergies—a safe method to survive software failures

ACM Transactions on Computer Systems (TOCS)
Efficient checkpointing of java software using context-sensitive capture and replay

Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Dependability, Abstraction, and Programming

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Damage assessment and repair in attack resilient distributed database systems

Computer Standards & Interfaces
Towards formal methods for agent-based systems

1FACS'96 Proceedings of the 1st BCS-FACS conference on Northern Formal Methods
Research: Designing a system infrastructure for distributed programs

Computer Communications

Quantified Score

Hi-index	0.01

Visualization

Abstract

A simple and general design uses message-based communication to provide software tolerance of single-point hardware failures. By delivering all interprocess messages to inactive backups for both the sender and the destination, both backups are kept in a state in which they can take over for their primaries. An implementation for the Auragen 4000 series of M68000-based systems is described. The operating system, AurosTM, is a distributed version of UNIX*. Major goals have been transparency of fault tolerance and efficient execution in the absence of failure.