SIAM Journal on Scientific and Statistical Computing
Fine-grained parallelization of lattice QCD kernel routine on GPUs
Journal of Parallel and Distributed Computing
Message passing on data-parallel architectures
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
The reverse-acceleration model for programming petascale hybrid systems
IBM Journal of Research and Development
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Scaling lattice QCD beyond 100 GPUs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
FLAT: a GPU programming framework to provide embedded MPI
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Lattice QCD on GPU clusters, using the QUDA library and the Chroma software system
International Journal of High Performance Computing Applications
Porting production level quantum chromodynamics code to graphics processing units: a case study
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Hi-index | 0.00 |
Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromo- dynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.