DIB—a distributed implementation of backtracking
ACM Transactions on Programming Languages and Systems (TOPLAS)
Real-time, concurrent checkpoint for parallel programs
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
Cilk: an efficient multithreaded runtime system
Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Efficient load balancing for wide-area divide-and-conquer applications
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
ATLAS: an infrastructure for global computing
EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
Experiments with Migration of Message-Passing Tasks
GRID '00 Proceedings of the First IEEE/ACM International Workshop on Grid Computing
The Cactus Code: A Problem Solving Environment for the Grid
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Adaptive and reliable parallel computing on networks of workstations
ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
Hi-index | 0.00 |
Grid applications need to be fault tolerant, malleable, and migratable. In previous work, we have presented orphan saving, an efficient mechanism addressing these issues for divide-and-conquer applications. In this paper, we present a mechanism for writing partial results to checkpoint files, adding the capability to also tolerate the total loss of all processors, and to allow suspending and later resuming an application. Both mechanisms have only negligible overheads in the absence of faults, even with extremely short checkpointing intervals like one minute. In the case of faults, the new checkpointing mechanism outperforms orphan saving by 10% to 15 %. Also, suspending/resuming an application has only little overhead, making our approach very attractive for writing grid applications.