Persistent fault-tolerance for divide-and-conquer applications on the grid

  • Authors:
  • Gosia Wrzesinska;Ana-Maria Oprescu;Thilo Kielmann;Henri Bal

  • Affiliations:
  • Vrije Universiteit Amsterdam;Vrije Universiteit Amsterdam;Vrije Universiteit Amsterdam;Vrije Universiteit Amsterdam

  • Venue:
  • Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Grid applications need to be fault tolerant, malleable, and migratable. In previous work, we have presented orphan saving, an efficient mechanism addressing these issues for divide-and-conquer applications. In this paper, we present a mechanism for writing partial results to checkpoint files, adding the capability to also tolerate the total loss of all processors, and to allow suspending and later resuming an application. Both mechanisms have only negligible overheads in the absence of faults, even with extremely short checkpointing intervals like one minute. In the case of faults, the new checkpointing mechanism outperforms orphan saving by 10% to 15 %. Also, suspending/resuming an application has only little overhead, making our approach very attractive for writing grid applications.