Composable reliability for asynchronous systems

  • Authors:
  • Sunghwan Yoo;Charles Killian;Terence Kelly;Hyoun Kyu Cho;Steven Plite

  • Affiliations:
  • Purdue University and HP Labs;Purdue University;HP Labs;HP Labs and University of Michigan;Purdue University

  • Venue:
  • USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Distributed systems designs often employ replication to solve two different kinds of availability problems. First, to prevent the loss of data through the permanent destruction or disconnection of a distributed node, and second, to allow prompt retrieval of data when some distributed nodes respond slowly. For simplicity, many systems further handle crash-restart failures and timeouts by treating them as a permanent disconnection followed by the birth of a new node, relying on peer replication rather than persistent storage to preserve data. We posit that for applications deployed in modern managed infrastructures, delays are typically transient and failed processes and machines are likely to be restarted promptly, so it is often desirable to resume crashed processes from persistent checkpoints. In this paper we present MaceKen, a synthesis of complementary techniques including Ken, a lightweight and decentralized rollback-recovery protocol that transparently masks crash-restart failures by careful handling of messages and state checkpoints; and Mace, a programming toolkit supporting development of distributed applications and application-specific availability via replication. MaceKen requires near-zero additional developer effort--systems implemented in Mace can immediately benefit from the Ken protocol by virtue of following the Mace execution model. Moreover, this model allows multiple, independently developed application components to be seamlessly composed, preserving strong global reliability guarantees. Our implementation is available as open source software.