Checkpointing-based rollback recovery for parallel applications on the InteGrade grid middleware

Authors:
Raphael Y. de Camargo;Andrei Goldchleger;Fabio Kon;Alfredo Goldman
Affiliations:
University of São Paulo, São Paulo, SP, Brazil;University of São Paulo, São Paulo, SP, Brazil;University of São Paulo, São Paulo, SP, Brazil;University of São Paulo, São Paulo, SP, Brazil
Venue:
MGC '04 Proceedings of the 2nd workshop on Middleware for grid computing
Year:
2004

Citing 11
Cited 9

A bridging model for parallel computation

Communications of the ACM
PVM: a framework for parallel distributed computing

Concurrency: Practice and Experience
A metaobject protocol for C++

Proceedings of the tenth annual conference on Object-oriented programming systems, languages, and applications
The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
BSPlib: The BSP programming library

Parallel Computing
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
The Paderborn University BSP (PUB) Library - Design, Implementation and Performance

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
InteGrade object-oriented Grid middleware leveraging the idle computing power of desktop machines: Research Articles

Concurrency and Computation: Practice & Experience - Middleware for Grid Computing
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
A parallel wavefront algorithm for efficient biological sequence comparison

ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartII

The implementation of the BSP parallel computing model on the InteGrade Grid middleware

MGC '05 Proceedings of the 3rd international workshop on Middleware for grid computing
Realizing the e-science desktop peer using a peer-to-peer distributed virtual machine middleware

Proceedings of the 4th international workshop on Middleware for grid computing
Failure resilient real-time data federation system

SpringSim '09 Proceedings of the 2009 Spring Simulation Multiconference
A load balancing fault-tolerant algorithm for heterogeneous cluster environments

Neural, Parallel & Scientific Computations
An integration experience of a software architecture and a monitoring infrastructure to deploy applications with non-functional requirements in computing grids

Software—Practice & Experience
Scheduling moldable BSP tasks

JSSPP'05 Proceedings of the 11th international conference on Job Scheduling Strategies for Parallel Processing
Application-Level checkpointing techniques for parallel programs

ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Survey: Survey of fault tolerant techniques for grid

Computer Science Review
Implementation of the fault tolerance in computational grid using agents by meta-modelling approach

International Journal of Communication Networks and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user workstations. One of its goals is to support the execution of long-running parallel applications that present a considerable amount of communication among application nodes. However, in an environment composed of shared user workstations spread across many different LANs, machines may fail, become unaccessible, or may switch from idle to busy very rapidly, compromising the execution of the parallel application in some of its nodes. Thus, to provide some mechanism for fault-tolerance becomes a major requirement for such a system. In this paper, we describe the support for checkpoint-based rollback recovery of parallel BSP applications running over the InteGrade middleware. This mechanism consists of periodically saving application state to permit to restart its execution from an intermediate execution point in case of failure. A precompiler automatically instruments the source-code of a C/C++ application, adding code for saving and recovering application state. A failure detector monitors the application execution. In case of failure, the application is restarted from the last saved global check-point.