A scalable MPI_Comm_split algorithm for exascale computing

  • Authors:
  • Paul Sack;William Gropp

  • Affiliations:
  • University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign

  • Venue:
  • EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Existing algorithms for creating communicators in MPI programs will not scale well to future exascale supercomputers containing millions of cores. In this work, we present a novel communicator-creation algorithm that does scale well into millions of processes using three techniques: replacing the sorting at the end of MPI_Comm_split with merging as the color and key table is built, sorting the color and key table in parallel, and using a distributed table to store the output communicator data rather than a replicated table. This reduces the time cost of MPI_Comm_split in the worst case we consider from 22 seconds to 0.37 second. Existing algorithms build a table with as many entries as processes, using vast amounts of memory. Our algorithm uses a small, fixed amount of memory per communicator after MPI_Comm_split has finished and uses a fraction of the memory used by the conventional algorithm for temporary storage during the execution of MPI_Comm_split.