Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
ACM Transactions on Programming Languages and Systems (TOPLAS)
Distributed programming in Argus
Communications of the ACM
High-Performance Fault-Tolerant VLSI Systems Using Micro Rollback
IEEE Transactions on Computers
Principles of distributed database systems
Principles of distributed database systems
Distributed, object-based programming systems
ACM Computing Surveys (CSUR)
Concurrency control in advanced database applications
ACM Computing Surveys (CSUR)
Experience with transactions in QuickSilver
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Design and Evaluation of the Rollback Chip: Special Purpose Hardware for Time Warp
IEEE Transactions on Computers
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Hi-index | 0.00 |
Distributed system builders are faced with the task of meeting a variety of requirements on the global behaviour of the target system, such as stability, fault-tolerance and failure recovery, concurrency control, commitment, and consistency of replicated data. The subset of these requirements relevant to a particular application we call its coherence constraint. The coherence constraint may be very difficult to enforce.Existing operating system services do not provide the system builder with an adequate platform for addressing coherence, although some systems address other aspects of coherence; for example, Isis [3] addresses the fault-tolerance issue. Even recent developments in micro-kernels such as Mach 3.0 [4] and Chorus [18], which have concentrated on supporting the shared-memory abstraction, still leave the systems builder to bridge a significant gap between OS services and basic coherence requirements. The variety of coherence requirements has given rise to a welter of mechanisms having a familial resemblance yet lacking real conceptual integration [16,17,20]. Consequently, the distributed application programmer treats each requirement in isolation, often resulting in costly solutions which are nevertheless obscure and idiosyncratic.Such problems have been observed in the context of object-based programming environments such as Argus [13], Clouds [7] and others [6]. They are confirmed by our own experience with a persistent object store transaction mechanism using NFS-oriented file locking [5,15].This paper describes an approach to distributed coherence enforcement based upon rollback. The approach is optimistic in the sense that violations of coherence are resolved rather than prevented---rollback is the agent of this resolution.Support for coherence is provided by units of distributed computation called transactions. This transaction mechanism is highly controllable, being designed to support advanced database requirements, involving "non-atomic" transactions, as well as conventional atomic transactions (c.f [19]). The transaction service is underpinned by rollback to provide the synchronisation, supported in turn by stable checkpointing and an integrated IPC protocol.The approach raises two key issues. The first is the problem of disseminating rollback properly through a distributed system. The second arises because computational progress does not occur monotonically in physical time but along its own virtual time axis, and concerns the interaction of these two time axes.