The BlueGene/L supercomputer and quantum ChromoDynamics
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Scientific computing Kernels on the cell processor
International Journal of Parallel Programming
Fine-grained parallelization of lattice QCD kernel routine on GPUs
Journal of Parallel and Distributed Computing
FFTC: fastest Fourier transform for the IBM cell broadband engine
HiPC'07 Proceedings of the 14th international conference on High performance computing
A fast implementation of the octagon abstract domain on graphics hardware
SAS'07 Proceedings of the 14th international conference on Static Analysis
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Lattice Quantum Chromodynamic (QCD) models subatomic interactions based on a four-dimensional discretized space-time continuum. The Lattice QCD computation is one of the grand challenges in physics especially when modeling a lattice with small spacing. In this work, we study the implementation of the main kernel routine of Lattice QCD that dominates the execution time on the Cell Broadband Engine. We tackle the problem of efficient SIMD execution and the problem of limited bandwidth for data transfers with the off-chip memory. For efficient SIMD execution, we present runtime data fusion technique that groups data processed similarly at runtime. We also introduce analysis needed to reduce the pressure on the scarce memory bandwidth that limits the performance of this computation. We studied two implementations for the main kernel routine that exhibit different patterns of accessing the memory and thus allowing different sets of optimizations. We show the attributes that make one implementation more favorable in terms of performance. For lattice size that is significantly larger than the local store, our implementation achieves 31.2 GFlops for single precision computations and 16.6 GFlops for double precision computations on the PowerXCell 8i, an order of magnitude better than the performance achieved on most general-purpose processors.