Disaster recovery codes: increasing reliability with large-stripe erasure correcting codes
Proceedings of the 2007 ACM workshop on Storage security and survivability
Replication degree customization for high availability
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Co-designing the failure analysis and monitoring of large-scale systems
ACM SIGMETRICS Performance Evaluation Review
Availability in globally distributed storage systems
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Depot: cloud storage with minimal trust
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Overlay routing under geographically correlated failures in distributed event-based systems
OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II
Middleware for a re-configurable distributed archival store based on secret sharing
Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware
Depot: Cloud Storage with Minimal Trust
ACM Transactions on Computer Systems (TOCS)
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Proceedings of the 6th international workshop on Virtualization Technologies in Distributed Computing Date
Erasure coding in windows azure storage
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Understanding data survivability in archival storage systems
Proceedings of the 5th Annual International Systems and Storage Conference
Themis: an I/O-efficient MapReduce
Proceedings of the Third ACM Symposium on Cloud Computing
Reducing Correlated Failures Impact in Peer-to-Peer Storage Systems Using Mobile Agents Flocks
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 02
Using dark fiber to displace diesel generators
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
High availability is widely accepted as an explicit requirement for distributed storage systems. Tolerating correlated failures is a key issue in achieving high availability in today's wide-area environments. This paper systematically revisits previously proposed techniques for addressing correlated failures. Using several real-world failure traces, we qualitatively answer four important questions regarding how to design systems to tolerate such failures. Based on our results, we identify a set of design principles that system builders can use to tolerate correlated failures. We show how these lessons can be effectively used by incorporating them into IrisStore, a distributed read-write storage layer that provides high availability. Our results using IrisStore on the PlanetLab over an 8-month period demonstrate its ability to withstand large correlated failures and meet preconfigured availability targets.