Recent advances in checkpoint/recovery systems

Authors:
Greg Bronevetsky;Rohit Fernandes;Daniel Marques;Keshav Pingali;Paul Stodghill
Affiliations:
Cornell University, Department of Computer Science, Ithaca, NY;Cornell University, Department of Computer Science, Ithaca, NY;Cornell University, Department of Computer Science, Ithaca, NY;Cornell University, Department of Computer Science, Ithaca, NY;Cornell University, Department of Computer Science, Ithaca, NY
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 22
Cited 6

Supporting reverse execution for parallel programs

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Balancing runtime and replay costs in a trace-and-replay system

PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Memory exclusion: optimizing the performance of checkpointing systems

Software—Practice & Experience
Distributed Algorithms

Distributed Algorithms
The MOSIX Distributed Operating System: Load Balancing for UNIX

The MOSIX Distributed Operating System: Load Balancing for UNIX
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Heterogeneous process state capture and recovery through Process Introspection

Cluster Computing
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A component architecture for LAM/MPI (citation_only)

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Portable Checkpointing for Heterogeneous Archtitectures

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard
Compiler-Assisted Checkpointing

Compiler-Assisted Checkpointing
Guaranteed-quality parallel Delaunay refinement for restricted polyhedral domains

Computational Geometry: Theory and Applications - Special issue on the 18th annual symposium on computational geometry—SoCG2002
Application-level checkpointing for shared memory programs

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Optimizing Checkpoint Sizes in the C3 System

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Mobile MPI programs in computational grids

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic application-level checkpointing for high performance computing systems

Automatic application-level checkpointing for high performance computing systems
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

Self healing in System-S

Cluster Computing
Live migration of virtual machine based on full system trace and replay

Proceedings of the 18th ACM international symposium on High performance distributed computing
An evaluation of checkpoint recovery for massively multiplayer online games

Proceedings of the VLDB Endowment
Survey: Survey of fault tolerant techniques for grid

Computer Science Review
Alleviating scalability issues of checkpointing protocols

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Implementation of the fault tolerance in computational grid using agents by meta-modelling approach

International Journal of Communication Networks and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this, many developers have implemented it, by hand, into their applications. One of the uses of checkpointing is to help mitigate the effects of interruptions in computational service (both planned and unplanned) In fact, some supercomputing centers expect their users to use checkpointing as a matter of policy. And yet, few centers provide fully automatic checkpointing systems for their high-end production machines. The paper is a status report on our work on the family of C3 systems for (almost) fully automatic checkpointing for scientific applications. To date, we have shown that our techniques can be used for checkpointing sequential, MPI and OpenMP applications written in C, Fortran, and several other languages. A novel aspect of our work is that we have not built a single checkpointing system, rather, we have developed a methodology and a set of techniques that have enabled us to develop a number of systems, each meeting different design goals and efficiency requirements.