Fault tolerance under UNIX

Authors:
Anita Borg;Wolfgang Blau;Wolfgang Graetsch;Ferdinand Herrmann;Wolfgang Oberle
Affiliations:
Digital Equipment Corp., Palo Alto, CA;Tandem Computers GmbH, Frankfurt, W. Germany;Nixdorf Computer GmbH, Paderborn, W. Germany;Nixdorf Computer GmbH, Paderborn, W. Germany;Nixdorf Computer GmbH, Paderborn, W. Germany
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
1989

Citing 10
Cited 63

Highly available systems for database applications

ACM Computing Surveys (CSUR)
Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Recovery Techniques for Database Systems

ACM Computing Surveys (CSUR)
The Recovery Manager of the System R Database Manager

ACM Computing Surveys (CSUR)
Guardians and Actions: Linguistic Support for Robust, Distributed Programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
Byzantine generals in action: implementing fail-stop processors

ACM Transactions on Computer Systems (TOCS)
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
A message system supporting fault tolerance

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Publishing: a reliable broadcast communication mechanism

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Process migration in DEMOS/MP

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles

Phoenix: a safe in-memory file system

Communications of the ACM
Fault-tolerant computing based on Mach

ACM SIGOPS Operating Systems Review
Understanding fault-tolerant distributed systems

Communications of the ACM
An implementation for small databases with high availability

ACM SIGOPS Operating Systems Review
Transparent optimistic rollback recovery

ACM SIGOPS Operating Systems Review
Some ideas on support for fault tolerance in COMANDOS, an object oriented distributed system

ACM SIGOPS Operating Systems Review
Stable transactional memories and fault tolerant architectures

ACM SIGOPS Operating Systems Review
Restoring consistent global states of distributed computations

PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
An annotated bibliography of dependable distributed computing

ACM SIGOPS Operating Systems Review
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.

IEEE Transactions on Parallel and Distributed Systems
A highly available scalable ITV system

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
On the relevance of communication costs of rollback-recovery protocols

Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Progressive Retry for Software Failure Recovery in Message-Passing Applications

IEEE Transactions on Computers
Efficient transparent application recovery in client-server information systems

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Persistent messages in local transactions

PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
Support for Software Interrupts in Log-Based Rollback-Recovery

IEEE Transactions on Computers
Fast cluster failover using virtual memory-mapped communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Reconfiguration Models and Algorithms for Stateful Interactive Processes

IEEE Transactions on Software Engineering
Replicated condition monitoring

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Transparent optimistic rollback recovery

EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Some ideas on support for fault tolerance in COMANDOS, an object oriented distributed system

EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Stable transactional memories and fault tolerant architectures

EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
The evolution of Coda

ACM Transactions on Computer Systems (TOCS)
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Run-time adaptation in river

ACM Transactions on Computer Systems (TOCS)
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Processor and Memory-Based Checkpoint and Rollback Recovery

Computer
Operation Shipping for Mobile File Systems

IEEE Transactions on Computers
Efficient Rollback-Recovery Technique in Distributed Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Performance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments

ICPP '97 Proceedings of the international Conference on Parallel Processing
Fault tolerant matrix operations using checksum and reverse computation

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Supporting nondeterministic execution in fault-tolerant systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Fault Tolerance for Off-the-Shelf Applications and Hardware

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
An algorithm for Supporting Fault Tolerant Objects in Distributed Object-Oriented Operating Systems

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Distributed recovery with K-optimistic logging

Journal of Parallel and Distributed Computing
Improving Logging and Recovery Performance in Phoenix/App

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Recovery guarantees for Internet applications

ACM Transactions on Internet Technology (TOIT)
Finding and preventing run-time error handling mistakes

OOPSLA '04 Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
Active Replication of Multithreaded Applications

IEEE Transactions on Parallel and Distributed Systems
Rewind, repair, replay: three R's to dependability

EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Finding a suitable checkpoint and recovery protocol for a distributed application

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Design, Analysis and Performance Evaluation of a New Algorithm for Developing a Fault Tolerant Distributed System

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Recovering device drivers

ACM Transactions on Computer Systems (TOCS)
Undo for operators: building an undoable e-mail store

ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Treating bugs as allergies: a safe method for surviving software failures

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Recovering device drivers

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automated response using system-call delays

SSYM'00 Proceedings of the 9th conference on USENIX Security Symposium - Volume 9
Operation-based update propagation in a mobile file system

ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
Sweeper: a lightweight end-to-end system for defending against fast worms

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Rx: Treating bugs as allergies—a safe method to survive software failures

ACM Transactions on Computer Systems (TOCS)
Exceptional situations and program reliability

ACM Transactions on Programming Languages and Systems (TOPLAS)
Scalable transparent checkpoint-restart of global address space applications on virtual machines over infiniband

Proceedings of the 6th ACM conference on Computing frontiers
Unstoppable stateful PHP web services

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Research: Designing a system infrastructure for distributed programs

Computer Communications
Ensuring reliability in B2B services: Fault tolerant inter-organizational workflows

Information Systems Frontiers

Quantified Score

Hi-index	0.04

Visualization

Abstract

The initial design for a distributed, fault-tolerant version of UNIX based on three-way atomic message transmission was presented in an earlier paper [3]. The implementation effort then moved from Auragen Systems1 to Nixdorf Computer where it was completed. This paper describes the working system, now known as the TARGON/32.The original design left open questions in at least two areas: fault tolerance for server processes and recovery after a crash were briefly and inaccurately sketched, rebackup after recovery was not discussed at all. The fundamental design involving three-way message transmission has remained unchanged. However, in addition to important changes in the implementation, server backup has been redesigned and is now more consistent with that of normal user processes. Recovery and rebackup have been completed in a less centralized and thus more efficient manner than previously envisioned.In this paper we review important aspects of the original design and note how the implementation differs from our original ideas. We then focus on the backup and recovery for server processes and the changes and additions in the design and implementation of recovery and rebackup.