Scale and performance in a distributed file system
ACM Transactions on Computer Systems (TOCS)
Evolution of storage facilities in AIX Version 3 for RISC System/6000 processors
IBM Journal of Research and Development
Availability in the Sprite distributed file system
ACM SIGOPS Operating Systems Review
Disconnected operation in the Coda File System
ACM Transactions on Computer Systems (TOCS)
Computing Systems
The Zebra striped network file system
ACM Transactions on Computer Systems (TOCS)
Fast crash recovery in distributed file systems
Fast crash recovery in distributed file systems
HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
Disk space guarantees as a distributed resource management problem: A case study
SPDP '95 Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing
Server recovery using naturally replicated state: a case study
ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Evaluation of design alternative for a cluster file system
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
An essential design pattern for fault-tolerant distributed state sharing
Communications of the ACM
Frangipani: a scalable distributed file system
Proceedings of the sixteenth ACM symposium on Operating systems principles
Scalable Session Locking for a Distributed File System
Cluster Computing
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Fastpath Optimizations for Cluster Recovery in Shared-Disk Systems
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Boxwood: abstractions as the foundation for storage infrastructure
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Chimera: data sharing flexibility, shared nothing simplicity
Proceedings of the 15th Symposium on International Database Engineering & Applications
A distributed locking protocol
CIS'04 Proceedings of the First international conference on Computational and Information Science
A locking protocol for distributed file systems
PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
A case study in distributed locking protocol on linux clusters
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
A locking protocol for a distributed computing environment
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Hi-index | 0.02 |
This article presents the deign and implementation of the recovery scheme in Calypso. Calypso is a cluster-optimized, distributed file system for UNIX clusters. As in Sprite and AFS, Calypso servers are stateful and scale well to a large number of clients. The recovery scheme in Calypso is nondisruptive, meaning that open files remain open, client modified data are saved, and in-flight operations are properly handled across server recover. The scheme uses distributed state amount the clients to reconstruct the server state on a backup node if disks are multiported or on the rebooted server node. It guarantees data consistency during recovery and provides congestion control. Measurements show that the state reconstruction can be quite fast: for example, in a 32-node cluster, when an average node contains state for about 420 files, the reconstruction time is about 3.3 seconds. However, the time to update a file system after a failure can be a major factor in the overall recovery time, even when using journaling techniques.