Quasi-Synchronous Checkpointing: Models, Characterization, and Classification
IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Efficient Message Logging for Uncoordinated Checkpointing Protocols
EDCC-2 Proceedings of the Second European Dependable Computing Conference on Dependable Computing
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A Checkpointing/Recovery System for MPI Applications on Cluster of IA-64 Computers
ICPPW '05 Proceedings of the 2005 International Conference on Parallel Processing Workshops
Making Java applications mobile or persistent
COOTS'01 Proceedings of the 6th conference on USENIX Conference on Object-Oriented Technologies and Systems - Volume 6
Parallel checkpointing on a grid-enabled java platform
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
An implementation of parallel file distribution in an agent hierarchy
The Journal of Supercomputing
Hi-index | 0.00 |
In this paper we describe an mpiJava extension that implements a parallel checkpointing/recovery service. This checkpointing/recovery facility is transparent to applications, i.e. no instrumentation is needed. We use a distributed approach for taking the checkpoints, which means that the processes take their local checkpoints independently. This approach reduces communication between processes and there is not need for a central server for checkpoint storage. We present some experiments which suggest that the benefits of this extended MPI functionality do not have a significant performance penalty as a side effect, apart from the well-known penalties related to the local checkpoint generation.