Subtleties in tolerating correlated failures in wide-area storage systems

  • Authors:
  • Suman Nath;Haifeng Yu;Phillip B. Gibbons;Srinivasan Seshan

  • Affiliations:
  • Microsoft Research;Intel Research Pittsburgh;Intel Research Pittsburgh;Carnegie Mellon University

  • Venue:
  • NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

High availability is widely accepted as an explicit requirement for distributed storage systems. Tolerating correlated failures is a key issue in achieving high availability in today's wide-area environments. This paper systematically revisits previously proposed techniques for addressing correlated failures. Using several real-world failure traces, we qualitatively answer four important questions regarding how to design systems to tolerate such failures. Based on our results, we identify a set of design principles that system builders can use to tolerate correlated failures. We show how these lessons can be effectively used by incorporating them into IrisStore, a distributed read-write storage layer that provides high availability. Our results using IrisStore on the PlanetLab over an 8-month period demonstrate its ability to withstand large correlated failures and meet preconfigured availability targets.