PipeCloud: using causality to overcome speed-of-light delays in cloud-based disaster recovery

Authors:
Timothy Wood;H. Andrés Lagar-Cavilla;K. K. Ramakrishnan;Prashant Shenoy;Jacobus Van der Merwe
Affiliations:
The George Washington University;AT&T Labs -- Research;AT&T Labs -- Research;University of Massachusetts Amherst;AT&T Labs -- Research
Venue:
Proceedings of the 2nd ACM Symposium on Cloud Computing
Year:
2011

Citing 17
Cited 3

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Providing high availability using lazy replication

ACM Transactions on Computer Systems (TOCS)
Managing update conflicts in Bayou, a weakly connected replicated storage system

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Grapevine: an exercise in distributed computing

Communications of the ACM
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
SnapMirror: File-System-Based Asynchronous Mirroring for Disaster Recovery

FAST '02 Proceedings of the Conference on File and Storage Technologies
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Rethink the sync

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Remus: high availability via asynchronous virtual machine replication

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Smoke and mirrors: reflecting files at a geographically remote location without loss of performance

FAST '09 Proccedings of the 7th conference on File and storage technologies
Tolerating latency in replicated state machines through client speculation

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
A self-organized, fault-tolerant and scalable replication scheme for cloud storage

Proceedings of the 1st ACM symposium on Cloud computing
Disaster recovery as a cloud service: economic benefits & deployment challenges

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Capo: recapitulating storage for virtual desktops

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
DepSky: dependable and secure storage in a cloud-of-clouds

Proceedings of the sixth conference on Computer systems
Designing for disasters

FAST'04 Proceedings of the 3rd USENIX conference on File and storage technologies
RemusDB: transparent high availability for database systems

The VLDB Journal — The International Journal on Very Large Data Bases

SecondSite: disaster tolerance as a service

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Hybrid cloud support for large scale analytics and web processing

WebApps'12 Proceedings of the 3rd USENIX conference on Web Application Development
Yank: enabling green data centers to pull the plug

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Disaster Recovery (DR) is a desirable feature for all enterprises, and a crucial one for many. However, adoption of DR remains limited due to the stark tradeoffs it imposes. To recover an application to the point of crash, one is limited by financial considerations, substantial application overhead, or minimal geographical separation between the primary and recovery sites. In this paper, we argue for cloud-based DR and pipelined synchronous replication as an antidote to these problems. Cloud hosting promises economies of scale and on-demand provisioning that are a perfect fit for the infrequent yet urgent needs of DR. Pipelined synchrony addresses the impact of WAN replication latency on performance, by efficiently overlapping replication with application processing for multi-tier servers. By tracking the consequences of the disk modifications that are persisted to a recovery site all the way to client-directed messages, applications realize forward progress while retaining full consistency guarantees for client-visible state in the event of a disaster. PipeCloud, our prototype, is able to sustain these guarantees for multi-node servers composed of black-box VMs, with no need of application modification, resulting in a perfect fit for the arbitrary nature of VM-based cloud hosting. We demonstrate disaster failover to the Amazon EC2 platform, and show that PipeCloud can increase throughput by an order of magnitude and reduce response times by more than half compared to synchronous replication, all while providing the same zero data loss consistency guarantees.