Composable reliability for asynchronous systems

Authors:
Sunghwan Yoo;Charles Killian;Terence Kelly;Hyoun Kyu Cho;Steven Plite
Affiliations:
Purdue University and HP Labs;Purdue University;HP Labs;HP Labs and University of Michigan;Purdue University
Venue:
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Year:
2012

Citing 24
Cited 5

Why cryptosystems fail

Communications of the ACM
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Distributed computing: a locality-sensitive approach

Distributed computing: a locality-sensitive approach
Elements of distributed computing

Elements of distributed computing
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Condor-G: A Computation Management Agent for Multi-Institutional Grids

Cluster Computing
A Toolkit for User-Level File Systems

Proceedings of the General Track: 2002 USENIX Annual Technical Conference
An Analysis of Communication-Induced Checkpointing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Scalability and accuracy in a large-scale network emulator

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Implementing declarative overlays

Proceedings of the twentieth ACM symposium on Operating systems principles
Handling churn in a DHT

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Mace: language support for building distributed systems

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Flux: a language for programming high-performance servers

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Protection and communication abstractions for web browsers in MashupOS

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Verifiable functional purity in java

Proceedings of the 15th ACM conference on Computer and communications security
SPLAY: distributed systems evaluation made simple (or how to turn ideas into live systems in a breeze)

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Finding latent performance bugs in systems implementations

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Friday: global comprehension for distributed replay

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
InContext: simple parallelism for distributed applications

Proceedings of the 20th international symposium on High performance distributed computing
Practical hardening of crash-tolerant systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference

Practical hardening of crash-tolerant systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Programming model support for dependable, elastic cloud applications

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
Distributed electronic rights in javascript

ESOP'13 Proceedings of the 22nd European conference on Programming Languages and Systems
Failure-atomic msync(): a simple and efficient mechanism for preserving the integrity of durable data

Proceedings of the 8th ACM European Conference on Computer Systems
EventWave: programming model and runtime support for tightly-coupled elastic cloud applications

Proceedings of the 4th annual Symposium on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributed systems designs often employ replication to solve two different kinds of availability problems. First, to prevent the loss of data through the permanent destruction or disconnection of a distributed node, and second, to allow prompt retrieval of data when some distributed nodes respond slowly. For simplicity, many systems further handle crash-restart failures and timeouts by treating them as a permanent disconnection followed by the birth of a new node, relying on peer replication rather than persistent storage to preserve data. We posit that for applications deployed in modern managed infrastructures, delays are typically transient and failed processes and machines are likely to be restarted promptly, so it is often desirable to resume crashed processes from persistent checkpoints. In this paper we present MaceKen, a synthesis of complementary techniques including Ken, a lightweight and decentralized rollback-recovery protocol that transparently masks crash-restart failures by careful handling of messages and state checkpoints; and Mace, a programming toolkit supporting development of distributed applications and application-specific availability via replication. MaceKen requires near-zero additional developer effort--systems implemented in Mace can immediately benefit from the Ken protocol by virtue of following the Mace execution model. Moreover, this model allows multiple, independently developed application components to be seamlessly composed, preserving strong global reliability guarantees. Our implementation is available as open source software.