The bottom-up implementation of one MILC lattice QCD application on the cell blade

Authors:
Guochun Shi;Volodymyr Kindratenko;Steven Gottlieb
Affiliations:
National Center for Supercomputing Applications, University of Illinois, Urbana, IL;National Center for Supercomputing Applications, University of Illinois, Urbana, IL;Department of Physics, Indiana University, Bloomington, IN
Venue:
International Journal of Parallel Programming
Year:
2009

Citing 4
Cited 2

Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Overview of the QCDSP and QCDOC computers

IBM Journal of Research and Development
The potential of on-chip multiprocessing for QCD machines

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
A synchronous mode MPI implementation on the cell BETM architecture

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications

Load balancing for regular meshes on SMPs with MPI

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Towards using and improving the NAS parallel benchmarks: a parallel patterns approach

Proceedings of the 2010 Workshop on Parallel Programming Patterns

Quantified Score

Hi-index	0.00

Visualization

Abstract

We report the results of the bottom-up implementation of one MILC lattice quantum chromodynamics (QCD) application on the Cell Broadband Engine™ processor. In our implementation, we preserve MILC's framework for scaling the application to run on a large number of compute nodes and accelerate computationally intensive kernels on the Cell's synergistic processor elements. Speedups of 3.4× for the 8 × 8 × 16 × 16 lattice and 5.7× for the 16 × 16 × 16 × 16 lattice are obtained when comparing our implementation of the MILC application executed on a 3.2 GHz Cell processor to the standard MILC code executed on a quad-core 2.33 GHz Intel Xeon processor. We provide an empirical model to predict application performance for a given lattice size. We also show that performance of the compute-intensive part of the application on the Cell processor is limited by the bandwidth between main memory and the Cell's synergistic processor elements, whereas performance of the application's parallel execution framework is limited by the bandwidth between main memory and the Cell's power processor element.