Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers

  • Authors:
  • K. Kandalla;U. Yang;J. Keasler;T. Kolev;A. Moody;H. Subramoni;K. Tomko;J. Vienne;Bronis R. de Supinski;Dhabaleswar K. Panda

  • Affiliations:
  • -;-;-;-;-;-;-;-;-;-

  • Venue:
  • IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Scientists across a wide range of domains increasingly rely on computer simulation for their investigations. Such simulations often spend a majority of their run-times solving large systems of linear equations that require vast amounts of computational power and memory. It is hence critical to design solvers in a highly efficient and scalable manner. Hypre is a high performance, scalable software library that offers several optimized linear solver routines and pre-conditioners. In this paper, we study the characteristics of Hypre's Preconditioned Conjugate Gradient (PCG) solver algorithm. The PCG routine is known to spend a majority of its communication time in the MPI All reduce operation to compute a global summation during the inner product operation. The MPI All reduce is a blocking operation, whose latency is often a limiting factor to the overall efficiency of the PCG solver routine, and correspondingly the performance of simulations that rely on this solver. Hence, hiding the latency of the MPI All reduce operation is critical towards scaling the PCG solver routine and improving the performance of many simulations. The upcoming revision of MPI, MPI-3, will provide support for non-blocking collective communication to enable latency-hiding. The latest Infini Band adapter from Mellanox, ConnectX-2, enables offloading of generalized lists of communication operations to the network interface. Such an interface can be leveraged to design non-blocking collective operations. In this paper, we design fully functional, scalable algorithms for the MPI Iall reduce operation, based on the network offload technology. To the best of our knowledge, these network offload-based algorithms are the first to be presented for the MPI Iall reduce operation. Our designs scale beyond 512 processes and we achieve near perfect communication/computation overlap. We also re-design the PCG solver routine to leverage our proposed MPI Iall reduce operation to hide the latency of the global reduction operations. We observe up to 21% improvements in the run-times of the PCG routine, when compared to the default PCG implementation in Hypre. We also note that about 16% of the overall benefits are due to overlapping the All reduce operations.