Implementing Wilson-Dirac operator on the cell broadband engine

Authors:
Khaled Z. Ibrahim;Francois Bodin
Affiliations:
IRISA/INRIA, Rennes, France;IRISA/INRIA, Rennes, France
Venue:
Proceedings of the 22nd annual international conference on Supercomputing
Year:
2008

Citing 4
Cited 5

The BlueGene/L supercomputer and quantum ChromoDynamics

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
FFTC: fastest Fourier transform for the IBM cell broadband engine

HiPC'07 Proceedings of the 14th international conference on High performance computing
A fast implementation of the octagon abstract domain on graphics hardware

SAS'07 Proceedings of the 14th international conference on Static Analysis

Vector stream processing for effective application of heterogeneous parallelism

Proceedings of the 2009 ACM symposium on Applied Computing
Multi-core acceleration of chemical kinetics for simulation and prediction

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
SAMS multi-layout memory: providing multiple views of data to boost SIMD performance

Proceedings of the 24th ACM International Conference on Supercomputing
An efficient CELL library for lattice quantum chromodynamics

ACM SIGARCH Computer Architecture News
Scalable heterogeneous parallelism for atmospheric modeling and simulation

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computing the actions of Wilson-Dirac operator contributes most of the CPU time for the grand challenge problem of simulating Lattice Quantum Chromodynamics (Lattice QCD). This routine exhibits many challenges in implementation on most computational environments because of the multiple patterns of accessing the same data, making it difficult to align the data efficiently at compile time. Additionally, the low computation to memory access ratio makes this computation bounded by the memory bandwidth and the memory latency. In this work, we present an implementation of this routine on the Cell Broadband Engine. We propose runtime data fusion, an approach that aims at re-aligning data at runtime, for data that cannot be aligned optimally at compile time, thus improving the performance of SIMDized execution. We also show a DMA optimization technique that reduces the impact of bandwidth limits on performance. Our implementation for this routine achieves 31.2 GFlops for single precision computations and 8.75 GFlops for double precision computations.