Fine-grained parallelization of lattice QCD kernel routine on GPUs

Authors:
Khaled Z. Ibrahim;François Bodin;Olivier Pène
Affiliations:
IRISA/INRIA, Campus de Beaulieu, Rennes 35042, France;IRISA/INRIA, Campus de Beaulieu, Rennes 35042, France;LPT/CNRS, Université Paris-Sud, Orsay 91405, France
Venue:
Journal of Parallel and Distributed Computing
Year:
2008

Citing 4
Cited 3

LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Computing for LQCD: apeNEXT

Computing in Science and Engineering
Massively parallel quantum chromodynamics

IBM Journal of Research and Development
A fast implementation of the octagon abstract domain on graphics hardware

SAS'07 Proceedings of the 14th international conference on Static Analysis

Efficient SIMDization and data management of the Lattice QCD computation on the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Orders-of-magnitude performance increases in GPU-accelerated correlation of images from the International Space Station

Journal of Real-Time Image Processing
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Simulation time for the classical problem of Lattice Quantum Chromodynamics (Lattice QCD) is dominated by one kernel routine responsible for computing the actions of a Dirac operator. This paper describes an experience in parallelizing this kernel routine. We explore parallelization granularities for this kernel routine on Graphical Processing Units (GPUs). We show that fine-grained parallelism can outperform coarse-grained parallelization, given that control-flow and communication effects are minimized. We propose two techniques for transforming control-flow-based code to control-free code. We also show how to reduce the communication effect by optimizing for commonly used sequences of calls to this routine. In our implementation on NVIDIA 8800 GTX, we were able to achieve an 8.3x speedup over an SSE2 optimized version on 2.8 GHz Intel Xeon CPU.