Persistent fault-tolerance for divide-and-conquer applications on the grid

Authors:
Gosia Wrzesinska;Ana-Maria Oprescu;Thilo Kielmann;Henri Bal
Affiliations:
Vrije Universiteit Amsterdam;Vrije Universiteit Amsterdam;Vrije Universiteit Amsterdam;Vrije Universiteit Amsterdam
Venue:
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Year:
2007

Citing 10
Cited 0

DIB—a distributed implementation of backtracking

ACM Transactions on Programming Languages and Systems (TOPLAS)
Real-time, concurrent checkpoint for parallel programs

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Efficient checkpointing on MIMD architectures

Efficient checkpointing on MIMD architectures
Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Efficient load balancing for wide-area divide-and-conquer applications

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
ATLAS: an infrastructure for global computing

EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
Experiments with Migration of Message-Passing Tasks

GRID '00 Proceedings of the First IEEE/ACM International Workshop on Grid Computing
The Cactus Code: A Problem Solving Environment for the Grid

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Adaptive and reliable parallel computing on networks of workstations

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Grid applications need to be fault tolerant, malleable, and migratable. In previous work, we have presented orphan saving, an efficient mechanism addressing these issues for divide-and-conquer applications. In this paper, we present a mechanism for writing partial results to checkpoint files, adding the capability to also tolerate the total loss of all processors, and to allow suspending and later resuming an application. Both mechanisms have only negligible overheads in the absence of faults, even with extremely short checkpointing intervals like one minute. In the case of faults, the new checkpointing mechanism outperforms orphan saving by 10% to 15 %. Also, suspending/resuming an application has only little overhead, making our approach very attractive for writing grid applications.