Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Distributed Name Service in Harness
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
An Architecture of Stampi: MPI Library on a Cluster of Parallel Computers
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
MPICH-G2: a Grid-enabled implementation of the Message Passing Interface
Journal of Parallel and Distributed Computing - Special issue on computational grids
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Combining FT-MPI with H2O: Fault-Tolerant MPI Across Administrative Boundaries
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
FT-MPI, fault-tolerant metacomputing and generic name services: a case study
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Hi-index | 0.00 |
There is a growing interest in deploying MPI over multiple, heterogenous and geographically distributed resources for performing very large scale computations. However, increasing the amount of geographical distribution and resources creates problems with interoperability and fault-tolerance. FT-MPI presents an interesting solution for adding fault-tolerance to MPI, but suffers from interoperability limitations and potential single points of failure when crossing multiple administrative domains. We propose to overcome these limitations by adding “pluggability” for one potential single point of failure – the name service used by FT-MPI – and combining FT-MPI with the H2O metacomputing framework.