Combining FT-MPI with H2O: Fault-Tolerant MPI Across Administrative Boundaries

  • Authors:
  • Dawid Kurzyniec;Vaidy Sunderam

  • Affiliations:
  • Emory University, Atlanta, GA;Emory University, Atlanta, GA

  • Venue:
  • IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We observe increasing interest in aggregating geographically distributed, heterogeneous resources to perform large scale computations. MPI remains the most popular programming paradigm for such applications; however, as the size of computing environments increases, fault tolerance aspects become critically important. We argue that the fault tolerance model proposed by FT-MPI fits well in geographically distributed environments, even though its current implementation is confined to a single administrative domain. We propose to overcome these limitations by combining FTMPI with the H2O resource sharing framework. Our approach allows users to run fault tolerant MPI programs on heterogeneous, geographically distributed shared machines, without sacrificing performance and with minimal involvement of resource providers.