Copysets: reducing the frequency of data loss in cloud storage

Authors:
Asaf Cidon;Stephen M. Rumble;Ryan Stutsman;Sachin Katti;John Ousterhout;Mendel Rosenblum
Affiliations:
Stanford University;Stanford University;Stanford University;Stanford University;Stanford University;Stanford University
Venue:
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Year:
2013

Citing 15
Cited 0

The Power of Two Choices in Randomized Load Balancing

IEEE Transactions on Parallel and Distributed Systems
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Combinatorial Designs: Constructions and Analysis

Combinatorial Designs: Constructions and Analysis
Glacier: highly durable, decentralized storage despite massive correlated failures

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Efficient replica maintenance for distributed storage systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Availability of multi-object operations

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
The Case for Energy-Proportional Computing

Computer
On the energy (in)efficiency of Hadoop clusters

ACM SIGOPS Operating Systems Review
Evolution and future directions of large-scale storage and computation systems at Google

Proceedings of the 1st ACM symposium on Cloud computing
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Sierra: practical power-proportionality for data center storage

Proceedings of the sixth conference on Computer systems
Apache hadoop goes realtime at Facebook

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Fast crash recovery in RAMCloud

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Windows Azure Storage: a highly available cloud storage service with strong consistency

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles

Quantified Score

Hi-index	0.00

Visualization

Abstract

Random replication is widely used in data center storage systems to prevent data loss. However, random replication is almost guaranteed to lose data in the common scenario of simultaneous node failures due to cluster-wide power outages. Due to the high fixed cost of each incident of data loss, many data center operators prefer to minimize the frequency of such events at the expense of losing more data in each event. We present Copyset Replication, a novel general-purpose replication technique that significantly reduces the frequency of data loss events. We implemented and evaluated Copyset Replication on two open source data center storage systems, HDFS and RAMCloud, and show it incurs a low overhead on all operations. Such systems require that each node's data be scattered across several nodes for parallel data recovery and access. Copyset Replication presents a near optimal tradeoff between the number of nodes on which the data is scattered and the probability of data loss. For example, in a 5000-node RAMCloud cluster under a power outage, Copyset Replication reduces the probability of data loss from 99.99% to 0.15%. For Facebook's HDFS cluster, it reduces the probability from 22.8% to 0.78%.