Fault Tolerance for Cluster Computing Based on Functional Tasks
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Extending PVM with Consistent Cut Capabilities: Application Aspects and Implementation Strategies
Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Abstract: This paper presents CPVM, a library that provides the user with a support to implement non-blocking, global checkpoint-restart algorithms for applications written using PVM thereby achieving fault-tolerance. A salient feature of CPVM is the way in which, solely on the basis of a simple set of new PVM primitives, it provides several advanced facilities useful to solve different problems. CPVM can also be used as a platform to implement different algorithms to detect stable properties such as deadlocks and termination, and to support job-swapping and migration in an environment where there previously was none.