Failure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters

Authors:
Dawid Kurzyniec;Vaidy Sunderam
Affiliations:
Department of Math and Computer Science, Emory University, Atlanta, GA 30322, USA;Department of Math and Computer Science, Emory University, Atlanta, GA 30322, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2005

Citing 13
Cited 7

Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
A grid-enabled MPI: message passing in heterogeneous distributed computing systems

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Network performance-aware collective communication for clustered wide-area systems

Parallel Computing - Clusters and computational grids for scientific computing
HARNESS and fault tolerant MPI

Parallel Computing - Clusters and computational grids for scientific computing
Supporting efficient execution in heterogeneous distributed computing environments with cactus and globus

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A Comparison of Conventional Distributed Computing Environments and Computational Grids

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH/MADIII: a Cluster of Clusters Enabled MPI Implementation

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
MPICH-G2: a Grid-enabled implementation of the Message Passing Interface

Journal of Parallel and Distributed Computing - Special issue on computational grids
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Grids: The top ten questions

Scientific Programming

Supporting data management on cluster grids

Future Generation Computer Systems
A GCM-based runtime support for parallel grid applications

Proceedings of the 2008 compFrame/HPC-GECO workshop on Component based high performance
A Web Services Gateway for the H2O Lightweight Grid Computing Framework

ServiceWave '08 Proceedings of the 1st European Conference on Towards a Service-Based Internet
Parallel and distributed computing on multidomain non-routable networks

International Journal of High Performance Computing and Networking
Running PVM applications on multidomain clusters

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Utilizing PVM in a multidomain clusters environment

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Exploiting multidomain non routable networks

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose lightweight middleware solutions that facilitate and simplify the execution of failure-resilient Message Passing Interface (MPI) programs across multidomain clusters. The system described in this paper leverages H2O, a distributed metacomputing framework, to route MPI message passing across heterogeneous aggregates located in different administrative or network domains. MPI communication is aided by a specially written H2O pluglet; messages that are destined for remote sites are intercepted and transparently forwarded to their final destinations. We demonstrate that the proposed technique is indeed effective in enabling communication by MPI programs across distinct clusters and across firewalls. Only marginally lowered performance was observed in our tests, and we believe the substantially increased functionality would compensate for this overhead in most situations. In addition to enabling multicluster communications, we note that with the increasing size and distribution of metacomputing environments, fault tolerance aspects become critically important. We argue that the fault tolerance model proposed by FT-MPI fits well in geographically distributed environments, even though its current implementation is confined to a single administrative domain. We describe extensions to overcome these limitations by combining FT-MPI with the H2O framework. Our holistic approach allows users to run fault-tolerant MPI programs on heterogeneous, geographically distributed shared machines, without sacrificing performance and with minimal involvement of resource providers.