Efficient SIMDization and data management of the Lattice QCD computation on the Cell Broadband Engine

Authors:
Khaled Z. Ibrahim;Franç/ois Bodin
Affiliations:
Corresponding author: Khaled Z. Ibrahim, IRISA/INRIA, Campus de Beaulieu, Rennes 35042, France. Tel.: +33 2 9984 7110/ Fax: +33 2 9984 7171/ E-mail: kibrahim@irisa.fr;IRISA/INRIA, Campus de Beaulieu, Rennes, France
Venue:
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Year:
2009

Citing 5
Cited 1

The BlueGene/L supercomputer and quantum ChromoDynamics

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Fine-grained parallelization of lattice QCD kernel routine on GPUs

Journal of Parallel and Distributed Computing
FFTC: fastest Fourier transform for the IBM cell broadband engine

HiPC'07 Proceedings of the 14th international conference on High performance computing
A fast implementation of the octagon abstract domain on graphics hardware

SAS'07 Proceedings of the 14th international conference on Static Analysis

High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Lattice Quantum Chromodynamic (QCD) models subatomic interactions based on a four-dimensional discretized space-time continuum. The Lattice QCD computation is one of the grand challenges in physics especially when modeling a lattice with small spacing. In this work, we study the implementation of the main kernel routine of Lattice QCD that dominates the execution time on the Cell Broadband Engine. We tackle the problem of efficient SIMD execution and the problem of limited bandwidth for data transfers with the off-chip memory. For efficient SIMD execution, we present runtime data fusion technique that groups data processed similarly at runtime. We also introduce analysis needed to reduce the pressure on the scarce memory bandwidth that limits the performance of this computation. We studied two implementations for the main kernel routine that exhibit different patterns of accessing the memory and thus allowing different sets of optimizations. We show the attributes that make one implementation more favorable in terms of performance. For lattice size that is significantly larger than the local store, our implementation achieves 31.2 GFlops for single precision computations and 16.6 GFlops for double precision computations on the PowerXCell 8i, an order of magnitude better than the performance achieved on most general-purpose processors.