Recovery in the Calypso file system

Authors:
Murthy Devarakonda;Bill Kish;Ajay Mohindra
Affiliations:
IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
1996

Citing 11
Cited 11

Scale and performance in a distributed file system

ACM Transactions on Computer Systems (TOCS)
Evolution of storage facilities in AIX Version 3 for RISC System/6000 processors

IBM Journal of Research and Development
Availability in the Sprite distributed file system

ACM SIGOPS Operating Systems Review
Disconnected operation in the Coda File System

ACM Transactions on Computer Systems (TOCS)
Recovery in Spritely NFS

Computing Systems
The Zebra striped network file system

ACM Transactions on Computer Systems (TOCS)
Fast crash recovery in distributed file systems

Fast crash recovery in distributed file systems
RELACS: A communications infrastructure for constructing reliable applications in large-scale distributed systems

HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
Disk space guarantees as a distributed resource management problem: A case study

SPDP '95 Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing
Server recovery using naturally replicated state: a case study

ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Evaluation of design alternative for a cluster file system

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

An essential design pattern for fault-tolerant distributed state sharing

Communications of the ACM
Frangipani: a scalable distributed file system

Proceedings of the sixteenth ACM symposium on Operating systems principles
Scalable Session Locking for a Distributed File System

Cluster Computing
HAMFS File System

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Fastpath Optimizations for Cluster Recovery in Shared-Disk Systems

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Boxwood: abstractions as the foundation for storage infrastructure

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Chimera: data sharing flexibility, shared nothing simplicity

Proceedings of the 15th Symposium on International Database Engineering & Applications
A distributed locking protocol

CIS'04 Proceedings of the First international conference on Computational and Information Science
A locking protocol for distributed file systems

PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
A case study in distributed locking protocol on linux clusters

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
A locking protocol for a distributed computing environment

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing

Quantified Score

Hi-index	0.02

Visualization

Abstract

This article presents the deign and implementation of the recovery scheme in Calypso. Calypso is a cluster-optimized, distributed file system for UNIX clusters. As in Sprite and AFS, Calypso servers are stateful and scale well to a large number of clients. The recovery scheme in Calypso is nondisruptive, meaning that open files remain open, client modified data are saved, and in-flight operations are properly handled across server recover. The scheme uses distributed state amount the clients to reconstruct the server state on a backup node if disks are multiported or on the rebooted server node. It guarantees data consistency during recovery and provides congestion control. Measurements show that the state reconstruction can be quite fast: for example, in a 32-node cluster, when an average node contains state for about 420 files, the reconstruction time is about 3.3 seconds. However, the time to update a file system after a failure can be a major factor in the overall recovery time, even when using journaling techniques.