Scale and performance in a distributed file system
ACM Transactions on Computer Systems (TOCS)
Reimplementing the Cedar file system using logging and group commit
SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
Coda: A Highly Available File System for a Distributed Workstation Environment
IEEE Transactions on Computers
Hive: fault containment for shared-memory multiprocessors
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Serverless network file systems
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Cluster-based scalable network services
Proceedings of the sixteenth ACM symposium on Operating systems principles
Frangipani: a scalable distributed file system
Proceedings of the sixteenth ACM symposium on Operating systems principles
Locality-aware request distribution in cluster-based network servers
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
The design of a multicast-based distributed file system
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Extendible hashing—a fast access method for dynamic files
ACM Transactions on Database Systems (TODS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Inside the Windows NT File System
Inside the Windows NT File System
Harvest, Yield, and Scalable Tolerant Systems
HOTOS '99 Proceedings of the The Seventh Workshop on Hot Topics in Operating Systems
Characteristics of File System Workloads
Characteristics of File System Workloads
Taming aggressive replication in the Pangaea wide-area file system
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Taming aggressive replication in the Pangaea wide-area file system
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Improving storage system availability with D-GRAID
ACM Transactions on Storage (TOS)
Awarded Best Student Paper! -- Improving Storage System Availability with D-GRAID
FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Chunkfs: using divide-and-conquer to improve file system reliability and repair
HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Graceful degradation via versions: specifications and implementations
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Glamor: an architecture for file system federation
IBM Journal of Research and Development
Kinesis: A new approach to replica placement in distributed storage systems
ACM Transactions on Storage (TOS)
Improving storage system availability with D-GRAID
FAST'04 Proceedings of the 3rd USENIX conference on File and storage technologies
Chunkfs: using divide-and-conquer to improve file system reliability and repair
HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Hi-index | 0.00 |
Maintaining availability in the face of failures is a critical requirement for Internet services. Existing approaches in cluster-based data storage rely on redundancy to survive a small number of failures, but the system becomes entirely unavailable if more failures occur. We describe an approach that allows a cluster file server to isolate failures so that the system can continue to serve most clients. Our approach is complementary to existing redundancy-based methods: redundancy can mask the first few failures, and failure isolation can take over and maintain availability for the majority of clients if more failures occur. The building blocks of our design are self-contained and load-balanced file servers called islands. The main idea underlying island-based design is the one-island principle: as many operations as possible should involve exactly one island. The one-island principle provides failure isolation because each island can function independently of other islands' failures. It also helps the file system scale with the system and workload sizes because communication and synchronization across islands are reduced. We implemented a prototype island-based file system called Archipelago on a cluster of PCs running Windows NT 4.0 connected by Ethernet. The measurement of micro benchmark shows that Archipelago adds little overhead to NTFS and Win32 RPC performance; while the measurement of operation mixes based on NTFS traces shows a speedup of 15.7 on 16 islands