Recovery management in QuickSilver

Authors:
R. Haskin;Y. Malachi;W. Sawdon;G. Chan
Affiliations:
IBM Almaden Research Center, San Jose, CA;IBM Almaden REsearch Center, Sann Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden REsearrch Center, San Jose, CA
Venue:
SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
Year:
1987

Citing 3
Cited 5

Nested transactions: an approach to reliable distributed computing

Nested transactions: an approach to reliable distributed computing
Distributed deadlock detection algorithm

ACM Transactions on Database Systems (TODS)
Synchronization and recovery of actions

PODC '83 Proceedings of the second annual ACM symposium on Principles of distributed computing

Speed Log: A Generic Log Service Supporting Efficient Node-Crash Recovery

IEEE Micro
Transactional file systems can be fast

Proceedings of the 11th workshop on ACM SIGOPS European workshop
xCalls: safe I/O in memory transactions

Proceedings of the 4th ACM European conference on Computer systems
NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
TABLEFS: enhancing metadata efficiency in the local file system

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

One price of extensibility and distribution, as implemented in QuickSilver, is a more complicated set of failure modes, and the consequent necessity of dealing with them. In traditional operating systems, services (e.g., file, display) are intrinsic pieces of the kernel. Process state is maintained in kernel tables, and the kernel contains explicit cleanup code (e.g., to close files, reclaim memory, and get rid of process images after hardware or software failures). QuickSilver, however, is structured according to the client-server model, and as in many systems of its type, system services are implemented by user-level processes that maintain a substantial amount of client process state. Examples of this state are the open files, screen windows, address space, etc., belonging to a process. Failure resilience in such an environment requires that clients and servers be aware of problems involving each other. Examples of the way one would like the system to behave include having files closed and windows removed from the screen when a client terminates, and having clients see bad return codes (rather than hanging) when a file server crashes. This motivates a number of design goals:Properly written programs (especially servers) should be resilient to external process and machine failures, and should be able to recover all resources associated with failed entities.Server processes should contain their own recovery code. The kernel should not make any distinction between system service processes and normal application processes. To avoid the proliferation of ad-hoc recovery mechanisms, there should be a uniform system-wide architecture for recovery management.A client may invoke several independent servers to perform a set of logically related activitites (a unit of work) that must execute atomically in the presence of failures, that is, either all the related activities should occur or none of them should. The recovery mechanism should support this.In QuickSilver, recovery is based on the database notion of atomic transactions, which are made available as a system service to be used by other, higher-level servers. This allows meeting all the above design goals. Software portability is important in the QuickSilver environment, dictating that transaction-based recovery be accessible to conventional programming languages rather than a special-purpose one such as Argus [Liskov84]. To accommodate servers with diverse recovery demands, the low-level primitives of commit coordination and log recovery are exposed directly rather than building recovery on top of a stable-storage mechanism such as in CPR [Attanasio87] or recoverable objects such as those in Camelot [Spector87] or Clouds [Allchin&McKendry83].