A checkpointing strategy for scalable recovery on distributed parallel systems

Authors:
Vijay K. Naik;Samuel P. Midkiff;Jose E. Moreira
Affiliations:
IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY
Venue:
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Year:
1997

Citing 12
Cited 1

Design and Evaluation of primitives for Parallel I/O

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
SP2 system architecture

IBM Systems Journal
Parallel file systems for the IBM SP computers

IBM Systems Journal
Checkpointing in distributed computing systems

Journal of Parallel and Distributed Computing
Dynamic resource management on distributed systems using reconfigurable applications

IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Dome: Parallel Programming in a Distributed Computing Environment

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Portable checkpointing and recovery

HPDC '95 Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing
A Task Migration Implementation of the Message-Passing Interface

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Memory Exclusion: Optimizing the Performance of CheckpointingSystems

Memory Exclusion: Optimizing the Performance of CheckpointingSystems

Algorithm-based fault tolerance applied to high performance computing

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe a new scheme for checkpointing parallel applications on message-passing scalable distributed memory systems. The novelty of our scheme is that a checkpointed application can be restored, from its checkpointed state, in a reconfigured form. Thus, a parallel application may be checkpointed while executing with t1 tasks on p1 processors, and then restarted from the checkpointed state with t2 tasks on p2 processors. As a result, applications can recover from partial failures in the underlying system. Also, the reconfigurable checkpointed states can be migrated from one parallel system to another even if they do not have the same number of processors. We describe a new programming model for implementing a reconfigurable checkpointing scheme for parallel programs. This new model is derived from the DRMS programming model, developed in the context of run-time reconfiguration of parallel applications. A key component of our implementation is the distribution-independent representation of application array data structures in persistent storage. For further optimizing the performance of checkpoint/restart operations, we provide parallel array section streaming operations for such distributed arrays. We present performance data for the reconfigurable checkpointing and restarting of parallel applications and compare that with the performance of conventional forms of checkpointing. Our results demonstrate the advantages of the new scheme we describe.