Fault-tolerant parallel applications with dynamic parallel schedules: a programmer's perspective

  • Authors:
  • Sebastian Gerlach;Basile Schaeli;Roger D. Hersch

  • Affiliations:
  • Ecole Polytechnique Fédérale de Lausanne (EPFL), School of Computer and Communication Sciences, Ecublens, Switzerland;Ecole Polytechnique Fédérale de Lausanne (EPFL), School of Computer and Communication Sciences, Ecublens, Switzerland;Ecole Polytechnique Fédérale de Lausanne (EPFL), School of Computer and Communication Sciences, Ecublens, Switzerland

  • Venue:
  • Dependable Systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Dynamic Parallel Schedules (DPS) is a flow graph based framework for developing parallel applications on clusters of workstations. The DPS flow graph execution model enables automatic pipelined parallel execution of applications. DPS supports graceful degradation of parallel applications in case of node failures. The fault-tolerance mechanism relies on a set of backup threads stored in the volatile storage of alternate nodes that are kept up to date by both duplicating transmitted data objects and performing periodical checkpointing. The current state of a failed node can be reconstructed on its backup threads by re-executing the application since the last checkpoint. A valid execution order is automatically deduced from the flow graph. The addition of fault-tolerance to a DPS application requires only minor changes to the application's source code. The present contribution focuses on the development of fault-tolerant parallel applications with DPS from a programmer's perspective.