Checkpointing BSP parallel applications on the InteGrade Grid middleware: Research Articles

  • Authors:
  • Raphael Y. de Camargo;Andrei Goldchleger;Fabio Kon;Alfredo Goldman

  • Affiliations:
  • Department of Computer Science, University of São Paulo, Rua do Matão, 1010, 05508-090 São Paulo, SP, Brazil;Department of Computer Science, University of São Paulo, Rua do Matão, 1010, 05508-090 São Paulo, SP, Brazil;Department of Computer Science, University of São Paulo, Rua do Matão, 1010, 05508-090 São Paulo, SP, Brazil;Department of Computer Science, University of São Paulo, Rua do Matão, 1010, 05508-090 São Paulo, SP, Brazil

  • Venue:
  • Concurrency and Computation: Practice & Experience - Middleware for Grid Computing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

InteGrade is a Grid middleware infrastructure that enables the use of idle computing power from user workstations. One of its goals is to support the execution of long-running parallel applications that present a considerable amount of communication among application nodes. However, in an environment composed of shared user workstations spread across many different LANs, machines may fail, become inaccessible, or may switch from idle to busy very rapidly, compromising the execution of the parallel application in some of its nodes. Thus, to provide some mechanism for fault tolerance becomes a major requirement for such a system. In this paper, we describe the support for checkpoint-based rollback recovery of Bulk Synchronous Parallel applications running over the InteGrade middleware. This mechanism consists of periodically saving application state to permit the application to restart its execution from an intermediate execution point in case of failure. A precompiler automatically instruments the source code of a C/C++ application, adding code for saving and recovering application state. A failure detector monitors the application execution. In case of failure, the application is restarted from the last saved global checkpoint. Copyright © 2005 John Wiley & Sons, Ltd.