Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

Authors:
Ronald Babich;Michael A. Clark;Bálint Joó
Affiliations:
-;-;-
Venue:
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2010

Citing 5
Cited 5

BI-CGSTAB: a fast and smoothly converging variant of BI-CG for the solution of nonsymmetric linear systems

SIAM Journal on Scientific and Statistical Computing
Reliable updated residuals in hybrid Bi-CG methods

Computing
Fine-grained parallelization of lattice QCD kernel routine on GPUs

Journal of Parallel and Distributed Computing
Message passing on data-parallel architectures

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
The reverse-acceleration model for programming petascale hybrid systems

IBM Journal of Research and Development

High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Scaling lattice QCD beyond 100 GPUs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
FLAT: a GPU programming framework to provide embedded MPI

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Lattice QCD on GPU clusters, using the QUDA library and the Chroma software system

International Journal of High Performance Computing Applications
Porting production level quantum chromodynamics code to graphics processing units: a case study

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromo- dynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.