Runtime MPI collective checking with tree-based overlay networks

  • Authors:
  • Tobias Hilbrich;Bronis R. de Supinski;Fabian Hänsel;Matthias S. Müller;Martin Schulz;Wolfgang E. Nagel

  • Affiliations:
  • Technische Universität Dresden, Dresden, Germany;Lawrence Livermore National Laboratory, Livermore, CA;Technische Universität Dresden, Dresden, Germany;RWTH Aachen University, Aachen, Germany;Lawrence Livermore National Laboratory, Livermore, CA;Technische Universität Dresden, Dresden, Germany

  • Venue:
  • Proceedings of the 20th European MPI Users' Group Meeting
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Runtime error detection tools detect many classes of MPI usage errors, including errors in collective communication calls. However, they often face scalability challenges. We present runtime checks for MPI collective operations that use a Tree-Based Overlay Network (TBON) for scalability and that provide full datatype matching. While we can use transitive correctness properties for most checks, some collective operations impose non-transitive correctness properties, e.g., MPI_Alltoallv, where we use an intralayer communication within the TBON to distribute datatype matching information. An overhead study with stress tests and two benchmark suites demonstrates applicability and scalability at 4,096, 2,048 and 16,384 processes respectively.