Copysets: reducing the frequency of data loss in cloud storage

  • Authors:
  • Asaf Cidon;Stephen M. Rumble;Ryan Stutsman;Sachin Katti;John Ousterhout;Mendel Rosenblum

  • Affiliations:
  • Stanford University;Stanford University;Stanford University;Stanford University;Stanford University;Stanford University

  • Venue:
  • USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Random replication is widely used in data center storage systems to prevent data loss. However, random replication is almost guaranteed to lose data in the common scenario of simultaneous node failures due to cluster-wide power outages. Due to the high fixed cost of each incident of data loss, many data center operators prefer to minimize the frequency of such events at the expense of losing more data in each event. We present Copyset Replication, a novel general-purpose replication technique that significantly reduces the frequency of data loss events. We implemented and evaluated Copyset Replication on two open source data center storage systems, HDFS and RAMCloud, and show it incurs a low overhead on all operations. Such systems require that each node's data be scattered across several nodes for parallel data recovery and access. Copyset Replication presents a near optimal tradeoff between the number of nodes on which the data is scattered and the probability of data loss. For example, in a 5000-node RAMCloud cluster under a power outage, Copyset Replication reduces the probability of data loss from 99.99% to 0.15%. For Facebook's HDFS cluster, it reduces the probability from 22.8% to 0.78%.