Lightweight checkpointing for concurrent ml

Authors:
Lukasz Ziarek;Suresh Jagannathan
Affiliations:
Department of computer science purdue university, 305 n. university street, west lafayette, in 47907-2107, usa (e-mail: lziarek@cs.purdue.edu, suresh@cs.purdue.edu);Department of computer science purdue university, 305 n. university street, west lafayette, in 47907-2107, usa (e-mail: lziarek@cs.purdue.edu, suresh@cs.purdue.edu)
Venue:
Journal of Functional Programming
Year:
2010

Citing 33
Cited 3

Debugging standard ML without reverse engineering

LFP '90 Proceedings of the 1990 ACM conference on LISP and functional programming
Real-time, concurrent checkpoint for parallel programs

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Debuggable concurrency extensions for standard ML

PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
ACTA: the SAGA continues

Database transaction models for advanced applications
Efficient optimistic concurrency control using loosely synchronized clocks

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Concurrent programming in ML

Concurrent programming in ML
On optimistic methods for concurrency control

ACM Transactions on Database Systems (TODS)
Effective fine-grain synchronization for automatically parallelized programs using optimistic synchronization primitives

ACM Transactions on Computer Systems (TOCS)
Performance analysis of checkpointing strategies

ACM Transactions on Computer Systems (TOCS)
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Transaction Processing: Concepts and Techniques

Transaction Processing: Concepts and Techniques
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
A User-level Checkpointing Library for POSIX Threads Programs

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Software transactional memory for dynamic-sized data structures

Proceedings of the twenty-second annual symposium on Principles of distributed computing
Selective Checkpointing and Rollbacks in Multithreaded Distributed Systems

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
On Page-Based Optimistic Process Checkpointing

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Compiler-Assisted Checkpointing

Compiler-Assisted Checkpointing
Language support for lightweight transactions

OOPSLA '03 Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications
Kill-safe synchronization abstractions

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Adaptive incremental checkpointing for massively parallel systems

Proceedings of the 18th annual international conference on Supercomputing
Searching for deadlocks while debugging concurrent haskell programs

Proceedings of the ninth ACM SIGPLAN international conference on Functional programming
Application-level checkpointing for shared memory programs

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Transactors: a programming model for maintaining globally consistent distributed state in unreliable environments

Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Theoretical foundations for compensations in flow composition languages

Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The Java memory model

Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Composable memory transactions

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
AtomCaml: first-class atomicity via rollback

Proceedings of the tenth ACM SIGPLAN international conference on Functional programming
Safe futures for Java

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Transactional events for ML

Proceedings of the 13th ACM SIGPLAN international conference on Functional programming
Transactional events1

Journal of Functional Programming
Partial memoization of concurrency and communication

Proceedings of the 14th ACM SIGPLAN international conference on Functional programming

Controlling reversibility in higher-order Pi

CONCUR'11 Proceedings of the 22nd international conference on Concurrency theory
LEAN: simplifying concurrency bug reproduction via replay-supported execution reduction

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Concurrent flexible reversibility

ESOP'13 Proceedings of the 22nd European conference on Programming Languages and Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Transient faults that arise in large-scale software systems can often be repaired by reexecuting the code in which they occur. Ascribing a meaningful semantics for safe reexecution in multithreaded code is not obvious, however. For a thread to reexecute correctly a region of code, it must ensure that all other threads that have witnessed its unwanted effects within that region are also reverted to a meaningful earlier state. If not done properly, data inconsistencies and other undesirable behavior might result. However, automatically determining what constitutes a consistent global checkpoint is not straightforward because thread interactions are a dynamic property of the program. In this paper, we present a safe and efficient checkpointing mechanism for Concurrent ML (CML) that can be used to recover from transient faults. We introduce a new linguistic abstraction, called stabilizers, that permits the specification of per-thread monitors and the restoration of globally consistent checkpoints. Safe global states are computed through lightweight monitoring of communication events among threads (e.g., message-passing operations or updates to shared variables). We present a formal characterization of its design, and provide a detailed description of its implementation within MLton, a whole-program optimizing compiler for Standard ML. Our experimental results on microbenchmarks as well as several realistic, multithreaded, server-style CML applications, including a web server and a windowing toolkit, show that the overheads to use stabilizers are small, and lead us to conclude that they are a viable mechanism for defining safe checkpoints in concurrent functional programs.