DMTCP: Transparent checkpointing for cluster computations and the desktop

Authors:
Jason Ansel;Kapil Arya;Gene Cooperman
Affiliations:
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, USA;College of Computer and Information Science, Northeastern University, Boston, MA, USA;College of Computer and Information Science, Northeastern University, Boston, MA, USA
Venue:
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Year:
2009

Citing 0
Cited 15

Pools of virtual boxes: building campus grids with virtual machines

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Modeling and tolerating heterogeneous failures in large parallel systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing
Checkpointing and migration of communication channels in heterogeneous grid environments

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Model checking distributed systems by combining caching and process checkpointing

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Independent checkpointing in a heterogeneous grid environment

Future Generation Computer Systems
Operating system support for redundant multithreading

Proceedings of the tenth ACM international conference on Embedded software
McrEngine: a scalable checkpointing system using data-aware aggregation and compression

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Adapting MPI to MapReduce PaaS Clouds: An Experiment in Cross-Paradigm Execution

UCC '12 Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing
Calculation of the subgroups of a trivial-fitting group

Proceedings of the 38th international symposium on International symposium on symbolic and algebraic computation
Chronicler: lightweight recording to reproduce field failures

Proceedings of the 2013 International Conference on Software Engineering
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing
Semi-automated debugging via binary search through a process lifetime

Proceedings of the Seventh Workshop on Programming Languages and Operating Systems
A framework for an in-depth comparison of scale-up and scale-out

DISCS-2013 Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems
McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/ semaphores, TCP/IP sockets, UNIX domain sockets, pipes, ptys (pseudo-terminals), terminal modes, ownership of controlling terminals, signal handlers, open file descriptors, shared open file descriptors, I/O (including the readline library), shared memory (via mmap), parent-child process relationships, pid virtualization, and other operating system artifacts. By emphasizing an unprivileged, user-space approach, compatibility is maintained across Linux kernels from 2.6.9 through the current 2.6.28. Since DMTCP is unprivileged and does not require special kernel modules or kernel patches, DMTCP can be incorporated and distributed as a checkpoint-restart module within some larger package.