Subtleties in tolerating correlated failures in wide-area storage systems

Authors:
Suman Nath;Haifeng Yu;Phillip B. Gibbons;Srinivasan Seshan
Affiliations:
Microsoft Research;Intel Research Pittsburgh;Intel Research Pittsburgh;Carnegie Mellon University
Venue:
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Year:
2006

Citing 0
Cited 16

Disaster recovery codes: increasing reliability with large-stripe erasure correcting codes

Proceedings of the 2007 ACM workshop on Storage security and survivability
Replication degree customization for high availability

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Co-designing the failure analysis and monitoring of large-scale systems

ACM SIGMETRICS Performance Evaluation Review
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Depot: cloud storage with minimal trust

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Overlay routing under geographically correlated failures in distributed event-based systems

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II
Middleware for a re-configurable distributed archival store based on secret sharing

Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware
Depot: Cloud Storage with Minimal Trust

ACM Transactions on Computer Systems (TOCS)
Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
A case for tracking and exploiting inter-node and intra-node memory content sharing in virtualized large-scale parallel systems

Proceedings of the 6th international workshop on Virtualization Technologies in Distributed Computing Date
Erasure coding in windows azure storage

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Understanding data survivability in archival storage systems

Proceedings of the 5th Annual International Systems and Storage Conference
Themis: an I/O-efficient MapReduce

Proceedings of the Third ACM Symposium on Cloud Computing
Reducing Correlated Failures Impact in Peer-to-Peer Storage Systems Using Mobile Agents Flocks

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 02
Using dark fiber to displace diesel generators

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

High availability is widely accepted as an explicit requirement for distributed storage systems. Tolerating correlated failures is a key issue in achieving high availability in today's wide-area environments. This paper systematically revisits previously proposed techniques for addressing correlated failures. Using several real-world failure traces, we qualitatively answer four important questions regarding how to design systems to tolerate such failures. Based on our results, we identify a set of design principles that system builders can use to tolerate correlated failures. We show how these lessons can be effectively used by incorporating them into IrisStore, a distributed read-write storage layer that provides high availability. Our results using IrisStore on the PlanetLab over an 8-month period demonstrate its ability to withstand large correlated failures and meet preconfigured availability targets.