Fine-grained parallelization of lattice QCD kernel routine on GPUs

  • Authors:
  • Khaled Z. Ibrahim;François Bodin;Olivier Pène

  • Affiliations:
  • IRISA/INRIA, Campus de Beaulieu, Rennes 35042, France;IRISA/INRIA, Campus de Beaulieu, Rennes 35042, France;LPT/CNRS, Université Paris-Sud, Orsay 91405, France

  • Venue:
  • Journal of Parallel and Distributed Computing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Simulation time for the classical problem of Lattice Quantum Chromodynamics (Lattice QCD) is dominated by one kernel routine responsible for computing the actions of a Dirac operator. This paper describes an experience in parallelizing this kernel routine. We explore parallelization granularities for this kernel routine on Graphical Processing Units (GPUs). We show that fine-grained parallelism can outperform coarse-grained parallelization, given that control-flow and communication effects are minimized. We propose two techniques for transforming control-flow-based code to control-free code. We also show how to reduce the communication effect by optimizing for commonly used sequences of calls to this routine. In our implementation on NVIDIA 8800 GTX, we were able to achieve an 8.3x speedup over an SSE2 optimized version on 2.8 GHz Intel Xeon CPU.