Failure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters

  • Authors:
  • Dawid Kurzyniec;Vaidy Sunderam

  • Affiliations:
  • Department of Math and Computer Science, Emory University, Atlanta, GA 30322, USA;Department of Math and Computer Science, Emory University, Atlanta, GA 30322, USA

  • Venue:
  • International Journal of High Performance Computing Applications
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose lightweight middleware solutions that facilitate and simplify the execution of failure-resilient Message Passing Interface (MPI) programs across multidomain clusters. The system described in this paper leverages H2O, a distributed metacomputing framework, to route MPI message passing across heterogeneous aggregates located in different administrative or network domains. MPI communication is aided by a specially written H2O pluglet; messages that are destined for remote sites are intercepted and transparently forwarded to their final destinations. We demonstrate that the proposed technique is indeed effective in enabling communication by MPI programs across distinct clusters and across firewalls. Only marginally lowered performance was observed in our tests, and we believe the substantially increased functionality would compensate for this overhead in most situations. In addition to enabling multicluster communications, we note that with the increasing size and distribution of metacomputing environments, fault tolerance aspects become critically important. We argue that the fault tolerance model proposed by FT-MPI fits well in geographically distributed environments, even though its current implementation is confined to a single administrative domain. We describe extensions to overcome these limitations by combining FT-MPI with the H2O framework. Our holistic approach allows users to run fault-tolerant MPI programs on heterogeneous, geographically distributed shared machines, without sacrificing performance and with minimal involvement of resource providers.