Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Concurrency control and reliability in distributed systems
Concurrency control and reliability in distributed systems
The notions of consistency and predicate locks in a database system
Communications of the ACM
Reliable Computer Systems
Approaches for System-Level Fault Tolerance in Distributed Real-Time Computer Systems
Fehlertolerierende Rechensysteme / Fault-Tolerant Computing Systems, Automatisierungssysteme, Methoden, Anwendungen / Automation Systems, Methods, Applications; 4. Internationale GI/ITG/GMA-Fachtagung
Notes on Data Base Operating Systems
Operating Systems, An Advanced Course
A program structure for error detection and recovery
Operating Systems, Proceedings of an International Symposium
System structure for software fault tolerance
IEEE Transactions on Software Engineering
Hi-index | 0.24 |
Design of loosely coupled distributed computer systems (DCS) required to tolerate propagated errors caused by software and/or hardware is a technological challenge that has been inadequately dealt with. In this paper, we adopt the view that a truly loosely coupled DCS consists of loosely coupled interacting processes distributed among multiple physical sites where each process is designed in the 'partitioned design' mode, i.e. designed with its interface specification only, rather than with full knowledge of interfaces between other processes (or sites). It then follows naturally that fault tolerance capabilities must be designed into loosely coupled processes in such systems without violating the partitioned design policy. The programmer-transparent coordination (PTC) scheme is one such approach that has been evolving since 1978. While the basic PTC scheme, called PTC/OR (PTC with obedient receiver) scheme, is a scheme for facilitating various forms of cooperative backward recovery in systems of loosely coupled processes, it has one drawback: the difficulty of bounding worst-case recovery time. After discussing various fundamentally different solution approaches and their limitations, a promising approach. called the PTC/SL (PTC with session leaders) scheme, which superimposes additional rules on structuring process interactions onto those of the PTC/OR scheme, is presented. Under the PTC/SL scheme, various flexible forms of process interactions are still allowed while the task of ensuring bounded recovery time is made a simple one.