Lightweight checkpointing for concurrent ml

  • Authors:
  • Lukasz Ziarek;Suresh Jagannathan

  • Affiliations:
  • Department of computer science purdue university, 305 n. university street, west lafayette, in 47907-2107, usa (e-mail: lziarek@cs.purdue.edu, suresh@cs.purdue.edu);Department of computer science purdue university, 305 n. university street, west lafayette, in 47907-2107, usa (e-mail: lziarek@cs.purdue.edu, suresh@cs.purdue.edu)

  • Venue:
  • Journal of Functional Programming
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Transient faults that arise in large-scale software systems can often be repaired by reexecuting the code in which they occur. Ascribing a meaningful semantics for safe reexecution in multithreaded code is not obvious, however. For a thread to reexecute correctly a region of code, it must ensure that all other threads that have witnessed its unwanted effects within that region are also reverted to a meaningful earlier state. If not done properly, data inconsistencies and other undesirable behavior might result. However, automatically determining what constitutes a consistent global checkpoint is not straightforward because thread interactions are a dynamic property of the program. In this paper, we present a safe and efficient checkpointing mechanism for Concurrent ML (CML) that can be used to recover from transient faults. We introduce a new linguistic abstraction, called stabilizers, that permits the specification of per-thread monitors and the restoration of globally consistent checkpoints. Safe global states are computed through lightweight monitoring of communication events among threads (e.g., message-passing operations or updates to shared variables). We present a formal characterization of its design, and provide a detailed description of its implementation within MLton, a whole-program optimizing compiler for Standard ML. Our experimental results on microbenchmarks as well as several realistic, multithreaded, server-style CML applications, including a web server and a windowing toolkit, show that the overheads to use stabilizers are small, and lead us to conclude that they are a viable mechanism for defining safe checkpoints in concurrent functional programs.