Recovery management in QuickSilver

  • Authors:
  • R. Haskin;Y. Malachi;W. Sawdon;G. Chan

  • Affiliations:
  • IBM Almaden Research Center, San Jose, CA;IBM Almaden REsearch Center, Sann Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden REsearrch Center, San Jose, CA

  • Venue:
  • SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
  • Year:
  • 1987

Quantified Score

Hi-index 0.00

Visualization

Abstract

One price of extensibility and distribution, as implemented in QuickSilver, is a more complicated set of failure modes, and the consequent necessity of dealing with them. In traditional operating systems, services (e.g., file, display) are intrinsic pieces of the kernel. Process state is maintained in kernel tables, and the kernel contains explicit cleanup code (e.g., to close files, reclaim memory, and get rid of process images after hardware or software failures). QuickSilver, however, is structured according to the client-server model, and as in many systems of its type, system services are implemented by user-level processes that maintain a substantial amount of client process state. Examples of this state are the open files, screen windows, address space, etc., belonging to a process. Failure resilience in such an environment requires that clients and servers be aware of problems involving each other. Examples of the way one would like the system to behave include having files closed and windows removed from the screen when a client terminates, and having clients see bad return codes (rather than hanging) when a file server crashes. This motivates a number of design goals:Properly written programs (especially servers) should be resilient to external process and machine failures, and should be able to recover all resources associated with failed entities.Server processes should contain their own recovery code. The kernel should not make any distinction between system service processes and normal application processes. To avoid the proliferation of ad-hoc recovery mechanisms, there should be a uniform system-wide architecture for recovery management.A client may invoke several independent servers to perform a set of logically related activitites (a unit of work) that must execute atomically in the presence of failures, that is, either all the related activities should occur or none of them should. The recovery mechanism should support this.In QuickSilver, recovery is based on the database notion of atomic transactions, which are made available as a system service to be used by other, higher-level servers. This allows meeting all the above design goals. Software portability is important in the QuickSilver environment, dictating that transaction-based recovery be accessible to conventional programming languages rather than a special-purpose one such as Argus [Liskov84]. To accommodate servers with diverse recovery demands, the low-level primitives of commit coordination and log recovery are exposed directly rather than building recovery on top of a stable-storage mechanism such as in CPR [Attanasio87] or recoverable objects such as those in Camelot [Spector87] or Clouds [Allchin&McKendry83].