ACM SIGOPS Operating Systems Review
A security architecture for computational grids
CCS '98 Proceedings of the 5th ACM conference on Computer and communications security
Future Generation Computer Systems - Special issue on metacomputing
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Resource Management Architecture for Metacomputing Systems
IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A Performance Oriented Migration Framework For The Grid
CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Performance of PVM with the MOSIX preemptive process migration scheme
ICCSSE '96 Proceedings of the 7th Israeli Conference on Computer-Based Systems and Software Engineering
The MyProxy online credential repository: Research Articles
Software—Practice & Experience - Grid Security
Evaluating the reliability of computational grids from the end user's point of view
Journal of Systems Architecture: the EUROMICRO Journal
Migol: A fault-tolerant service framework for MPI applications in the grid
Future Generation Computer Systems
Hi-index | 0.00 |
In a distributed, inherently dynamic Grid environment the reliability of individual resources cannot be guaranteed. The more resources and components are involved the more error-prone is the system. Therefore, it is important to enhance the dependability of the system with fault-tolerance mechanisms. In this paper, we present Migol, a fault-tolerant, self-healing Grid service infrastructure for MPI applications. The benefit of the Grid is that in case of a failure an application may be migrated and restarted from a checkpoint file on another site. This approach requires a service infrastructure which handles the necessary activities transparently for an application. But any migration framework cannot support fault-tolerant applications, if it is not fault-tolerant itself.