Communications of the ACM
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Distributed computing: a locality-sensitive approach
Distributed computing: a locality-sensitive approach
Elements of distributed computing
Elements of distributed computing
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Condor-G: A Computation Management Agent for Multi-Institutional Grids
Cluster Computing
A Toolkit for User-Level File Systems
Proceedings of the General Track: 2002 USENIX Annual Technical Conference
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Scalability and accuracy in a large-scale network emulator
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Implementing declarative overlays
Proceedings of the twentieth ACM symposium on Operating systems principles
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Mace: language support for building distributed systems
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Flux: a language for programming high-performance servers
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Protection and communication abstractions for web browsers in MashupOS
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Verifiable functional purity in java
Proceedings of the 15th ACM conference on Computer and communications security
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Finding latent performance bugs in systems implementations
Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
CIEL: a universal execution engine for distributed data-flow computing
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Friday: global comprehension for distributed replay
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
InContext: simple parallelism for distributed applications
Proceedings of the 20th international symposium on High performance distributed computing
Practical hardening of crash-tolerant systems
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Practical hardening of crash-tolerant systems
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Programming model support for dependable, elastic cloud applications
HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
Distributed electronic rights in javascript
ESOP'13 Proceedings of the 22nd European conference on Programming Languages and Systems
Proceedings of the 8th ACM European Conference on Computer Systems
EventWave: programming model and runtime support for tightly-coupled elastic cloud applications
Proceedings of the 4th annual Symposium on Cloud Computing
Hi-index | 0.00 |
Distributed systems designs often employ replication to solve two different kinds of availability problems. First, to prevent the loss of data through the permanent destruction or disconnection of a distributed node, and second, to allow prompt retrieval of data when some distributed nodes respond slowly. For simplicity, many systems further handle crash-restart failures and timeouts by treating them as a permanent disconnection followed by the birth of a new node, relying on peer replication rather than persistent storage to preserve data. We posit that for applications deployed in modern managed infrastructures, delays are typically transient and failed processes and machines are likely to be restarted promptly, so it is often desirable to resume crashed processes from persistent checkpoints. In this paper we present MaceKen, a synthesis of complementary techniques including Ken, a lightweight and decentralized rollback-recovery protocol that transparently masks crash-restart failures by careful handling of messages and state checkpoints; and Mace, a programming toolkit supporting development of distributed applications and application-specific availability via replication. MaceKen requires near-zero additional developer effort--systems implemented in Mace can immediately benefit from the Ken protocol by virtue of following the Mace execution model. Moreover, this model allows multiple, independently developed application components to be seamlessly composed, preserving strong global reliability guarantees. Our implementation is available as open source software.