Optimizing MPI collective communication by orthogonal structures

  • Authors:
  • Matthias Kühnemann;Thomas Rauber;Gudula Rünger

  • Affiliations:
  • Fakultät für Informatik, Technische Universität Chemnitz, Chemnitz, Germany 09107;Fakultät für Mathematik und Physik, Universität Bayreuth, Bayreuth, Germany 95445;Fakultät für Informatik, Technische Universität Chemnitz, Chemnitz, Germany 09107

  • Venue:
  • Cluster Computing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

MPI collective communication operations to distribute or gather data are used for many parallel applications from scientific computing, but they may lead to scalability problems since their execution times increase with the number of participating processors. In this article, we show how the execution time of collective communication operations can be improved significantly by an internal restructuring based on orthogonal processor structures with two or more levels. The execution time of operations like MPI_Bcast() or MPI_Allgather() can be reduced by 40% and 70% on a dual Xeon cluster and a Beowulf cluster with single-processor nodes. But also on a Cray T3E a significant performance improvement can be obtained by a careful selection of the processor structure. The use of these optimized communication operations can reduce the execution time of data parallel implementations of complex application programs significantly without requiring any other change of the computation and communication structure. We present runtime functions for the modeling of two-phase realizations and verify that these runtime functions can predict the execution time both for communication operations in isolation and in the context of application programs.