Fault tolerance in a distributed CHORUS/MiX system

Authors:
Sunil Kittur;Douglas Steel;Francois Armand;Jim Lipkis
Affiliations:
Online Media, Cambridge, UK and ICL High Performance Systems, Manchester, UK;ICL High Performance Systems, Manchester, UK;Chorus Systems, Saint-Quentin-En-Yvelines, France;Chorus Systems, Saint-Quentin-En-Yvelines, France
Venue:
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Year:
1996

Citing 7
Cited 2

A fast file system for UNIX

ACM Transactions on Computer Systems (TOCS)
Disconnected operation in the Coda file system

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Replication in the harp file system

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Recovery in Spritely NFS

Computing Systems
Fast crash recovery in distributed file systems

Fast crash recovery in distributed file systems
Experience with SVR4 Over Chorus

Proceedings of the Workshop on Micro-kernels and Other Kernel Architectures
Naming, state management, and user-level extensions in the sprite distributed file system

Naming, state management, and user-level extensions in the sprite distributed file system

Computer Immunology

LISA '98 Proceedings of the 12th USENIX conference on System administration
CuriOS: improving reliability through operating system structure

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Within a distributed system, resources may be shared between nodes. The system should continue to operate even if individual nodes fail due to hardware or software errors. This may result in the loss of resources that were hosted on the failed node, but it may be possible to continue to provide access to some resources by hosting them on another node. This paper describes mechanisms that allow the failover of resources from failed nodes. Failover is currently restricted to disk volumes and file systems. The failover mechanisms maintain the correct semantics at the UNIX system call level for operations from surviving nodes that were in progress at the time of the failure, including non-idempotent operations. Minimal resource and performance overheads are imposed for the normal running case, and in contrast to replication techniques, state is recovered and rebuilt at the time of a failover.