Improving communication-phase completion times in HPC clusters through congestion mitigation

  • Authors:
  • Yitzhak Birk;Vladimir Zdornov

  • Affiliations:
  • Israel Institute of Technology, Haifa, Israel;Israel Institute of Technology, Haifa, Israel

  • Venue:
  • SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Congestion arises in cluster-based supercomputers due to contention for links, spreads due to oversubscription of communication resources, and reduces performance. We mitigate it using efficient, scalable adaptive routing and explicit rate calculation. We use virtual circuits for in-order packet delivery; path setup is performed by switches locally with no blocking or backtracking. For random permutations in a slightly enriched fat-tree topology, maximum contention is reduced by up to 50% relative to static routing, but only rate control can translate this into actual gain. Unfortunately, TCP's window-based rate control fails because of the low bandwidth-delay product, and small buffers moreover cause congestion spreading even with a single-packet window. InfiniBand's CCA employs multiple parameters, which must apparently be tuned per topology and traffic pattern. Focusing on phase-based applications, we present a distributed explicit rate-assignment algorithm for completion-time minimization of the communication phase (min-max flow completion). Also, a generally applicable packet-injection scheme for a source with different-rate flows that realizes desired rates even with very small switch buffers. Simulations show that adaptive routing alone is ineffective, rate control's effectiveness is limited, yet together they shorten the communication phase by tens of percents. Finally, our explicit rate-calculation algorithm is faster than current reactive schemes.