A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Building and Using a Fault-Tolerant MPI Implementation
International Journal of High Performance Computing Applications
Managing the Execution of Large Scale MPI Applications on Computational Grids
SBAC-PAD '05 Proceedings of the 17th International Symposium on Computer Architecture on High Performance Computing
On the Advantages of an Alternative MPI Execution Model for Grids
CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
A concise introduction to autonomic computing
Advanced Engineering Informatics
Fault tolerance in an industrial seismic processing application for multicore clusters
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Hi-index | 0.00 |
Writing applications capable of executing efficiently in Grids is extremely difficult and tedious for inexperienced users. The distributed resources are typically heterogeneous, non-dedicated, and are offered without any performance or availability guarantees. Systems capable of adapting the execution of an application to the dynamic characteristics of the Grid are essential. This work describes the strategy used to bestow the self-healing property on autonomic EasyGrid MPI applications to withstand process and resource failures. This paper highlights both the difficulties and the low cost solution adopted to offer fault tolerance in applications based on the standard Grid installation of LAM/MPI.