Application-Level checkpointing techniques for parallel programs

Authors:
John Paul Walters;Vipin Chaudhary
Affiliations:
Institute for Scientific Computing, Wayne State University;Department of Computer Science and Engineering, University at Buffalo, The State University of New York
Venue:
ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Year:
2006

Citing 20
Cited 5

Application level fault tolerance in heterogeneous networks of workstations

Journal of Parallel and Distributed Computing
Process migration

ACM Computing Surveys (CSUR)
User-level process checkpoint and restore for migration

ACM SIGOPS Operating Systems Review
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
OpenMP: An Industry-Standard API for Shared-Memory Programming

IEEE Computational Science & Engineering
Arachne: A Portable Threads System Supporting Migrant Threads on Heterogeneous Network Farms

IEEE Transactions on Parallel and Distributed Systems
On Improving Thread Migration: Safety and Performance

HiPC '02 Proceedings of the 9th International Conference on High Performance Computing
Compile/Run-Time Support for Thread Migration

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Portable Checkpointing for Heterogeneous Archtitectures

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Heterogeneous Checkpointing for Multithreaded Applications

SRDS '02 Proceedings of the 21st IEEE Symposium on Reliable Distributed Systems
Evaluating Distributed Checkpointing Protocol

ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Libckpt: Transparent Checkpointing under Unix

Libckpt: Transparent Checkpointing under Unix
Process/Thread Migration and Checkpointing in Heterogeneous Distributed Systems

HICSS '04 Proceedings of the Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS'04) - Track 9 - Volume 9
Application-level checkpointing for shared memory programs

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Checkpointing-based rollback recovery for parallel applications on the InteGrade grid middleware

MGC '04 Proceedings of the 2nd workshop on Middleware for grid computing
Checkpoint and Restart for Distributed Components in XCAT3

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing

Fault-tolerant stream processing using a distributed, replicated file system

Proceedings of the VLDB Endowment
A Domain-Specific Language for Application-Level Checkpointing

ICDCIT '08 Proceedings of the 5th International Conference on Distributed Computing and Internet Technology
Performance evaluation of an application-level checkpointing solution on grids

Future Generation Computer Systems
A technique for non-invasive application-level checkpointing

The Journal of Supercomputing
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In its simplest form, checkpointing is the act of saving a program's computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every level of the system, from utilizing special hardware/architectural checkpointing features through modification of the user's source code. This survey will discuss the various techniques used in application-level checkpointing, with special attention being paid to techniques for checkpointing parallel and distributed applications.