The Power of Two Choices in Randomized Load Balancing
IEEE Transactions on Parallel and Distributed Systems
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Combinatorial Designs: Constructions and Analysis
Combinatorial Designs: Constructions and Analysis
Glacier: highly durable, decentralized storage despite massive correlated failures
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Efficient replica maintenance for distributed storage systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Availability of multi-object operations
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
On the energy (in)efficiency of Hadoop clusters
ACM SIGOPS Operating Systems Review
Evolution and future directions of large-scale storage and computation systems at Google
Proceedings of the 1st ACM symposium on Cloud computing
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Availability in globally distributed storage systems
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Sierra: practical power-proportionality for data center storage
Proceedings of the sixth conference on Computer systems
Apache hadoop goes realtime at Facebook
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Fast crash recovery in RAMCloud
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Windows Azure Storage: a highly available cloud storage service with strong consistency
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Hi-index | 0.00 |
Random replication is widely used in data center storage systems to prevent data loss. However, random replication is almost guaranteed to lose data in the common scenario of simultaneous node failures due to cluster-wide power outages. Due to the high fixed cost of each incident of data loss, many data center operators prefer to minimize the frequency of such events at the expense of losing more data in each event. We present Copyset Replication, a novel general-purpose replication technique that significantly reduces the frequency of data loss events. We implemented and evaluated Copyset Replication on two open source data center storage systems, HDFS and RAMCloud, and show it incurs a low overhead on all operations. Such systems require that each node's data be scattered across several nodes for parallel data recovery and access. Copyset Replication presents a near optimal tradeoff between the number of nodes on which the data is scattered and the probability of data loss. For example, in a 5000-node RAMCloud cluster under a power outage, Copyset Replication reduces the probability of data loss from 99.99% to 0.15%. For Facebook's HDFS cluster, it reduces the probability from 22.8% to 0.78%.