Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Journal of Computational Physics
A network-failure-tolerant message-passing system for terascale clusters
International Journal of Parallel Programming
Concurrency and Computation: Practice & Experience - Middleware for Grid Computing
Result Verification and Trust-Based Scheduling in Peer-to-Peer Grids
P2P '05 Proceedings of the Fifth IEEE International Conference on Peer-to-Peer Computing
On the Advantages of an Alternative MPI Execution Model for Grids
CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
EasyGrid Enabling of Iterative Tightly-Coupled Parallel MPI Applications
ISPA '08 Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications
A concise introduction to autonomic computing
Advanced Engineering Informatics
Hi-index | 0.01 |
Writing applications capable of executing efficiently in distributed systems is extremely difficult and tedious for inexperienced users. The resources may be heterogeneous, non-dedicated, and offered without any performance or availability guarantees. Systems capable of adapting the execution of an application to these characteristics are essential. The EasyGrid Application Management System (AMS) transforms cluster-based MPI applications into autonomic ones capable executing robustly and efficiently in distributed environments. This work describes a strategy to endow these autonomic MPI applications with the property of self-healing and thus be capable of withstanding multiple simultaneous crash faults of processes and/or processors. The extremely low intrusion cost of the proposed hybrid solution might now facilitate acceptance of fault tolerance techniques in large scale high performance applications.