Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
A grid-enabled MPI: message passing in heterogeneous distributed computing systems
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Network performance-aware collective communication for clustered wide-area systems
Parallel Computing - Clusters and computational grids for scientific computing
HARNESS and fault tolerant MPI
Parallel Computing - Clusters and computational grids for scientific computing
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A Comparison of Conventional Distributed Computing Environments and Computational Grids
ICCS '02 Proceedings of the International Conference on Computational Science-Part II
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH/MADIII: a Cluster of Clusters Enabled MPI Implementation
CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
MPICH-G2: a Grid-enabled implementation of the Message Passing Interface
Journal of Parallel and Distributed Computing - Special issue on computational grids
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scientific Programming
Supporting data management on cluster grids
Future Generation Computer Systems
A GCM-based runtime support for parallel grid applications
Proceedings of the 2008 compFrame/HPC-GECO workshop on Component based high performance
A Web Services Gateway for the H2O Lightweight Grid Computing Framework
ServiceWave '08 Proceedings of the 1st European Conference on Towards a Service-Based Internet
Parallel and distributed computing on multidomain non-routable networks
International Journal of High Performance Computing and Networking
Running PVM applications on multidomain clusters
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Utilizing PVM in a multidomain clusters environment
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Exploiting multidomain non routable networks
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
We propose lightweight middleware solutions that facilitate and simplify the execution of failure-resilient Message Passing Interface (MPI) programs across multidomain clusters. The system described in this paper leverages H2O, a distributed metacomputing framework, to route MPI message passing across heterogeneous aggregates located in different administrative or network domains. MPI communication is aided by a specially written H2O pluglet; messages that are destined for remote sites are intercepted and transparently forwarded to their final destinations. We demonstrate that the proposed technique is indeed effective in enabling communication by MPI programs across distinct clusters and across firewalls. Only marginally lowered performance was observed in our tests, and we believe the substantially increased functionality would compensate for this overhead in most situations. In addition to enabling multicluster communications, we note that with the increasing size and distribution of metacomputing environments, fault tolerance aspects become critically important. We argue that the fault tolerance model proposed by FT-MPI fits well in geographically distributed environments, even though its current implementation is confined to a single administrative domain. We describe extensions to overcome these limitations by combining FT-MPI with the H2O framework. Our holistic approach allows users to run fault-tolerant MPI programs on heterogeneous, geographically distributed shared machines, without sacrificing performance and with minimal involvement of resource providers.