Availability in the Sprite distributed file system

Authors:
Mary Baker;John Ousterhout
Affiliations:
Computer Science Division, Electrical Engineering and Computer Sciences, University of California, Berkeley, CA;Computer Science Division, Electrical Engineering and Computer Sciences, University of California, Berkeley, CA
Venue:
ACM SIGOPS Operating Systems Review
Year:
1991

Citing 8
Cited 4

Recovery management in QuickSilver

ACM Transactions on Computer Systems (TOCS)
A simple and efficient implementation of a small database

SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
The Sprite Network Operating System

Computer
A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Fault-tolerant computing based on Mach

ACM SIGOPS Operating Systems Review
The Design of the POSTGRES Storage System

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
The Design of XPRS

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
Naming, State Management, and User-Level Extensions in the Sprite

Naming, State Management, and User-Level Extensions in the Sprite

Recovery in the Calypso file system

ACM Transactions on Computer Systems (TOCS)
Not quite NFS, soft cache consistency for NFS

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
GIGA+: scalable directories for shared file systems

PDSW '07 Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07
Object storage on CRAQ: high-throughput chain replication for read-mostly workloads

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the Sprite environment, tolerating faults means recovering from them quickly. Our position is that performance and availability are the desired features of the typical locally-distributed office/engineering environment, and that very fast server recovery is the most cost-effective way of providing such availability. Mechanisms used for reliability can be inappropriate in systems with the primary goal of performance, and some availability-oriented methods using replicated hardware or processes cost too much for these systems. In contrast, availability via fast recovery need not slow down a system, and our experience in Sprite shows that in some cases the same techniques that provide high performance also provide fast recovery. In our first attempt to reduce file server recovery times to less than 90 seconds, we take advantage of the distributed state already present in our file system, and a high-performance log-structured file system currently under implementation. As a long-term goal, we hope to reduce recovery to 10 seconds or less.