Reliable orchestration of distributed MPI-Applications in a UNICORE-Based grid with MetaMPICH and metascheduling

  • Authors:
  • Boris Bierbaum;Carsten Clauss;Thomas Eickermann;Lidia Kirtchakova;Arnold Krechel;Stephan Springstubbe;Oliver Wäldrich;Wolfgang Ziegler

  • Affiliations:
  • Chair for Operating Systems, RWTH Aachen University, Aachen, Germany;Chair for Operating Systems, RWTH Aachen University, Aachen, Germany;Central Institute for Applied Mathematics, Research Centre Jülich, Jülich, Germany;Central Institute for Applied Mathematics, Research Centre Jülich, Jülich, Germany;Fraunhofer Institute SCAI, Sankt Augustin, Germany;Fraunhofer Institute SCAI, Sankt Augustin, Germany;Fraunhofer Institute SCAI, Sankt Augustin, Germany;Fraunhofer Institute SCAI, Sankt Augustin, Germany

  • Venue:
  • EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Running large MPI-applications with resource demands exceeding the local site's cluster capacity could be distributed across a number of clusters in a Grid instead, to satisfy the demand. However, there are a number of drawbacks limiting the applicability of this approach: communication paths between compute nodes of different clusters usually provide lower bandwidth and higher latency than the cluster internal ones, MPI libraries use dedicated I/O-nodes for inter-cluster communication which become a bottleneck, missing tools for co-ordinating the availability of the different clusters across different administrative domains is another issue. To make the Grid approach efficient several prerequisites must be in place: an implementation of MPI providing high-performance communication mechanisms across the borders of clusters, a network connection with high bandwidth and low latency dedicated to the application, compute nodes made available to the application exclusively, and finally a Grid middleware glueing together everything. In this paper we present work recently completed in the VIOLA project: MetaMPICH, user controlled QoS of clusters and interconnecting network, a MetaScheduling Service and the UNICORE integration.