Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems

Authors:
Pierre Riteau;Adrien Lebre;Christine Morin
Affiliations:
-;-;-
Venue:
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Year:
2009

Citing 12
Cited 1

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
SnapMirror: File-System-Based Asynchronous Mirroring for Disaster Recovery

FAST '02 Proceedings of the Conference on File and Storage Technologies
Integrating Checkpointing with Transaction Processing

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Grid Datafarm Architecture for Petascale Data Intensive Computing

CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
A Fault Tolerant MPI-IO Implementation using the Expand Parallel File System

PDP '05 Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
A Versatile and User-Oriented Versioning File System

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Ghost Process: a Sound Basis to Implement Process Duplication, Migration and Checkpoint/Restart in Linux Clusters

ISPDC '05 Proceedings of the The 4th International Symposium on Parallel and Distributed Computing
On incremental file system development

ACM Transactions on Storage (TOS)
File system design for an NFS file server appliance

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
AFRAID: a frequently redundant array of independent disks

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference

Comparing checkpoint and rollback recovery schemes in a cluster system

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computer clusters are today the reference architecture for high-performance computing. The large number of nodes in these systems induces a high failure rate. This makes fault tolerance mechanisms, e.g. process checkpoint/restart, a required technology to effectively exploit clusters. Most of the process checkpoint/restart implementations only handle volatile states and do not take into account persistent states of applications, which can lead to incoherent application restarts. In this paper, we introduce an efficient persistent state checkpoint/restoration approach that can be interconnected with a large number of file systems. To avoid the performance issues of a stable support relying on synchronous replication mechanisms, we present a failure resilience scheme optimized for such persistent state checkpointing techniques in a distributed environment. First evaluations of our implementation in the kDFS distributed file system show the negligible performance impact of our proposal.