The Totem single-ring ordering and membership protocol
ACM Transactions on Computer Systems (TOCS)
ACM SIGOPS Operating Systems Review
A security architecture for computational grids
CCS '98 Proceedings of the 5th ACM conference on Computer and communications security
Future Generation Computer Systems - Special issue on metacomputing
Distributed Operating Systems: The Logical Design
Distributed Operating Systems: The Logical Design
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Resource Management Architecture for Metacomputing Systems
IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
HARNESS fault tolerant MPI design, usage and performance issues
Future Generation Computer Systems - Grid computing: Towards a new computing infrastructure
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Performance of PVM with the MOSIX preemptive process migration scheme
ICCSSE '96 Proceedings of the 7th Israeli Conference on Computer-Based Systems and Software Engineering
The gSOAP Toolkit for Web Services and Peer-to-Peer Computing Networks
CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
Scalability of Multicast Based Synchronization Methods
EUROMICRO '98 Proceedings of the 24th Conference on EUROMICRO - Volume 2
The GSI Plug-in for gSOAP: Enhanced Security, Performance, and Reliability
ITCC '05 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume I - Volume 01
Fault Tolerance in Message Passing Interface Programs
International Journal of High Performance Computing Applications
A channel memory based fault tolerance for MPI applications
Future Generation Computer Systems - Special issue: Parallel computing technologies
New grid scheduling and rescheduling methods in the GrADS project
International Journal of Parallel Programming - Special issue: The next generation software program
Future Generation Computer Systems - Special section: Information engineering and enterprise architecture in distributed computing environments
Evaluation of UDDI as a provider of resource discovery services for OGSA-based grids
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Migol: a fault-tolerant service framework for MPI applications in the grid
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
An approach to grid resource selection and fault management based on ECA rules
Future Generation Computer Systems
Monte Carlo methods for matrix computations on the grid
Future Generation Computer Systems
Performance evaluation of an application-level checkpointing solution on grids
Future Generation Computer Systems
GRM: a reliable and fault tolerant data replication middleware for grid environment
Proceedings of the International Conference & Workshop on Emerging Trends in Technology
Future Generation Computer Systems
GRFM: an efficient grid-based replication and fault tolerant middleware
International Journal of Computational Science and Engineering
Hi-index | 0.00 |
Especially for sciences the provision of massive parallel CPU capacity is one of the most attractive features of a grid. A major challenge in a distributed, inherently dynamic grid is fault tolerance. The more resources and components involved, the more complicated and error-prone becomes the system. In a grid with potentially thousands of machines connected to each other the reliability of individual resources cannot be guaranteed. The benefit of the grid is that in case of a failure an application may be migrated and restarted from a checkpoint file on another site. This approach requires a service infrastructure which handles the necessary activities transparently. In this article, we present Migol, a fault-tolerant and self-healing grid middleware for MPI applications. Migol is based on open standards and extends the services of the Globus toolkit to support the fault tolerance of grid applications. Further, the Migol framework itself is designed with special focus on fault tolerance. For example, Migol replicates critical services and uses a ring-based replication protocol to achieve data consistency.