Design and Evaluation of primitives for Parallel I/O
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
IBM Systems Journal
Parallel file systems for the IBM SP computers
IBM Systems Journal
Checkpointing in distributed computing systems
Journal of Parallel and Distributed Computing
Dynamic resource management on distributed systems using reconfigurable applications
IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
ickp: A Consistent Checkpointer for Multicomputers
IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Dome: Parallel Programming in a Distributed Computing Environment
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Portable checkpointing and recovery
HPDC '95 Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing
A Task Migration Implementation of the Message-Passing Interface
HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Memory Exclusion: Optimizing the Performance of CheckpointingSystems
Memory Exclusion: Optimizing the Performance of CheckpointingSystems
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
In this paper, we describe a new scheme for checkpointing parallel applications on message-passing scalable distributed memory systems. The novelty of our scheme is that a checkpointed application can be restored, from its checkpointed state, in a reconfigured form. Thus, a parallel application may be checkpointed while executing with t1 tasks on p1 processors, and then restarted from the checkpointed state with t2 tasks on p2 processors. As a result, applications can recover from partial failures in the underlying system. Also, the reconfigurable checkpointed states can be migrated from one parallel system to another even if they do not have the same number of processors. We describe a new programming model for implementing a reconfigurable checkpointing scheme for parallel programs. This new model is derived from the DRMS programming model, developed in the context of run-time reconfiguration of parallel applications. A key component of our implementation is the distribution-independent representation of application array data structures in persistent storage. For further optimizing the performance of checkpoint/restart operations, we provide parallel array section streaming operations for such distributed arrays. We present performance data for the reconfigurable checkpointing and restarting of parallel applications and compare that with the performance of conventional forms of checkpointing. Our results demonstrate the advantages of the new scheme we describe.