Server-directed collective I/O in Panda
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Scalable message passing in Panda
Proceedings of the fourth workshop on I/O in parallel and distributed systems: part of the federated computing research conference
SCR algorithm: saving/restoring states of file systems
ACM SIGOPS Operating Systems Review
Process Recovery in Heterogeneous Systems
IEEE Transactions on Computers
Experiments with the CHIME Parallel Processing System
HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
Dome: Parallel Programming in a Distributed Computing Environment
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Persistent Array Access Using Server-Directed I/O
SSDBM '96 Proceedings of the Eighth International Conference on Scientific and Statistical Database Management
Hi-index | 0.00 |
We have been developing high-level checkpoint and restart methods for Dome (Distributed Object Migration Environment), a C++ library of data-parallel objects that are automatically distributed using PVM. There are several levels of programming abstraction at which fault tolerance mechanisms can be designed: high-level, where the checkpoint and restart are built into our C++ objects, but the program structure is severly constrained; high-level with preprocessing, where a preprocessor inserts extra C++ statements into the code to facilitate checkpoint and restart; and low-level, where periodically an interrupt causes a memory image to be written out. Because we consider portability (both of our libraries and of the checkpoints they produce) to be an important goal, we focus on the higher-level checkpointing methods. In addition, we describe an implementation of high-level checkpointing, demonstrate it on multiple architectures, and show that it is efficient enough to provide good expected run times with low overhead, even in the case of frequent failures.