A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems
Software—Practice & Experience
ACM Transactions on Computer Systems (TOCS)
Byzantine generals in action: implementing fail-stop processors
ACM Transactions on Computer Systems (TOCS)
OceanStore: an architecture for global-scale persistent storage
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Reliable Distributed Computing with the ISIS Toolkit
Reliable Distributed Computing with the ISIS Toolkit
A Decentralized Algorithm for Erasure-Coded Virtual Disks
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Efficient Byzantine-Tolerant Erasure-Coded Storage
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Total order broadcast and multicast algorithms: Taxonomy and survey
ACM Computing Surveys (CSUR)
Low-overhead byzantine fault-tolerant storage
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs
IEEE Transactions on Computers
Scalable performance of the Panasas parallel file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Optimistic Erasure-Coded Distributed Storage
DISC '08 Proceedings of the 22nd international symposium on Distributed Computing
The XtreemFS architecture—a case for object-based file systems in Grids
Concurrency and Computation: Practice & Experience - Selection of Best Papers of the VLDB Data Management in Grids Workshop (VLDB DMG 2007)
A Partial-Distribution-Fault-Aware Protocol for Consistent Updates in Distributed Storage Systems
SNAPI '08 Proceedings of the 2008 Fifth IEEE International Workshop on Storage Network Architecture and Parallel I/Os
International Journal of High Performance Computing Applications
Introduction to Reliable and Secure Distributed Programming
Introduction to Reliable and Secure Distributed Programming
Windows Azure Storage: a highly available cloud storage service with strong consistency
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Detecting failures in distributed systems with the Falcon spy network
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Consistency models for replicated data
Replication
Hi-index | 0.00 |
One challenge in applying erasure codes (or error-correcting codes) to distributed storage systems is to maintain consistency between data and redundancy blocks in the face of crashing servers. We present two access protocols that provide sequential consistency and maximum distance separable fault tolerance at the same time. The protocols use sequence numbers to recover a consistent version in the presence of failures or partial writes. The first (pessimistic) PSW protocol uses a master per stripe to execute updates in sequence. The second (optimistic) OCW protocol allows concurrent writes to blocks in the same stripe to happen in parallel at the cost of additional buffer space. We present empirical performance results for PSW and OCW and compare them to other protocols. Our results show that OCW is as fast as simple replication while providing better fault tolerance and/or reduced storage overhead. This demonstrates that erasure coding can be used as a space-efficient alternative to replication in distributed storage systems.