Application level fault tolerance in heterogeneous networks of workstations
Journal of Parallel and Distributed Computing
ACM Computing Surveys (CSUR)
User-level process checkpoint and restore for migration
ACM SIGOPS Operating Systems Review
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
OpenMP: An Industry-Standard API for Shared-Memory Programming
IEEE Computational Science & Engineering
Arachne: A Portable Threads System Supporting Migrant Threads on Heterogeneous Network Farms
IEEE Transactions on Parallel and Distributed Systems
On Improving Thread Migration: Safety and Performance
HiPC '02 Proceedings of the 9th International Conference on High Performance Computing
Compile/Run-Time Support for Thread Migration
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in application-level fault-tolerant MPI
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Portable Checkpointing for Heterogeneous Archtitectures
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Heterogeneous Checkpointing for Multithreaded Applications
SRDS '02 Proceedings of the 21st IEEE Symposium on Reliable Distributed Systems
Evaluating Distributed Checkpointing Protocol
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Libckpt: Transparent Checkpointing under Unix
Libckpt: Transparent Checkpointing under Unix
Process/Thread Migration and Checkpointing in Heterogeneous Distributed Systems
HICSS '04 Proceedings of the Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS'04) - Track 9 - Volume 9
Application-level checkpointing for shared memory programs
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Checkpointing-based rollback recovery for parallel applications on the InteGrade grid middleware
MGC '04 Proceedings of the 2nd workshop on Middleware for grid computing
Checkpoint and Restart for Distributed Components in XCAT3
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Fault-tolerant stream processing using a distributed, replicated file system
Proceedings of the VLDB Endowment
A Domain-Specific Language for Application-Level Checkpointing
ICDCIT '08 Proceedings of the 5th International Conference on Distributed Computing and Internet Technology
Performance evaluation of an application-level checkpointing solution on grids
Future Generation Computer Systems
A technique for non-invasive application-level checkpointing
The Journal of Supercomputing
The Journal of Supercomputing
Hi-index | 0.00 |
In its simplest form, checkpointing is the act of saving a program's computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every level of the system, from utilizing special hardware/architectural checkpointing features through modification of the user's source code. This survey will discuss the various techniques used in application-level checkpointing, with special attention being paid to techniques for checkpointing parallel and distributed applications.