A bridging model for parallel computation
Communications of the ACM
PVM: a framework for parallel distributed computing
Concurrency: Practice and Experience
Proceedings of the tenth annual conference on Object-oriented programming systems, languages, and applications
The grid: blueprint for a new computing infrastructure
The grid: blueprint for a new computing infrastructure
BSPlib: The BSP programming library
Parallel Computing
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
The Paderborn University BSP (PUB) Library - Design, Implementation and Performance
IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Concurrency and Computation: Practice & Experience - Middleware for Grid Computing
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
A parallel wavefront algorithm for efficient biological sequence comparison
ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartII
The implementation of the BSP parallel computing model on the InteGrade Grid middleware
MGC '05 Proceedings of the 3rd international workshop on Middleware for grid computing
Realizing the e-science desktop peer using a peer-to-peer distributed virtual machine middleware
Proceedings of the 4th international workshop on Middleware for grid computing
Failure resilient real-time data federation system
SpringSim '09 Proceedings of the 2009 Spring Simulation Multiconference
A load balancing fault-tolerant algorithm for heterogeneous cluster environments
Neural, Parallel & Scientific Computations
JSSPP'05 Proceedings of the 11th international conference on Job Scheduling Strategies for Parallel Processing
Application-Level checkpointing techniques for parallel programs
ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Survey: Survey of fault tolerant techniques for grid
Computer Science Review
Implementation of the fault tolerance in computational grid using agents by meta-modelling approach
International Journal of Communication Networks and Distributed Systems
Hi-index | 0.00 |
InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user workstations. One of its goals is to support the execution of long-running parallel applications that present a considerable amount of communication among application nodes. However, in an environment composed of shared user workstations spread across many different LANs, machines may fail, become unaccessible, or may switch from idle to busy very rapidly, compromising the execution of the parallel application in some of its nodes. Thus, to provide some mechanism for fault-tolerance becomes a major requirement for such a system. In this paper, we describe the support for checkpoint-based rollback recovery of parallel BSP applications running over the InteGrade middleware. This mechanism consists of periodically saving application state to permit to restart its execution from an intermediate execution point in case of failure. A precompiler automatically instruments the source-code of a C/C++ application, adding code for saving and recovering application state. A failure detector monitors the application execution. In case of failure, the application is restarted from the last saved global check-point.