Programming the Linpack benchmark for the IBM PowerXCell 8i processor

Authors:
Michael Kistler;John Gunnels;Daniel Brokenshire;Brad Benton
Affiliations:
-;-;-;IBM Corporation. E-mails: {mkistler, gunnels, brokensh, brad.benton}@us.ibm.com
Venue:
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Year:
2009

Citing 11
Cited 3

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Scalability issues affecting the design of a dense linear algebra library

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
Introduction to the cell broadband engine architecture

IBM Journal of Research and Development
Cell broadband engine architecture and its first implementation: a performance view

IBM Journal of Research and Development
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Multi-threading and one-sided communication in parallel LU factorization

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Petascale computing with accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

Adaptation of double-precision matrix multiplication to the cell broadband engine architecture

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing
An (almost) direct deployment of the Fast Multipole Method on the Cell processor

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present the design and implementation of the Linpack benchmark for the IBM BladeCenter QS22, which incorporates two IBM PowerXCell 8i 1 processors. The PowerXCell 8i is a new implementation of the Cell Broadband Engine™ 2 architecture and contains a set of special-purpose processing cores known as Synergistic Processing Elements (SPEs). The SPEs can be used as computational accelerators to augment the main PowerPC processor. The added computational capability of the SPEs results in a peak double precision floating point capability of 108.8 GFLOPS. We explain how we modified the standard open source implementation of Linpack to accelerate key computational kernels using the SPEs of the PowerXCell 8i processors. We describe in detail the implementation and performance of the computational kernels and also explain how we employed the SPEs for high-speed data movement and reformatting. The result of these modifications is a Linpack benchmark optimized for the IBM PowerXCell 8i processor that achieves 170.7 GFLOPS on a BladeCenter QS22 with 32 GB of DDR2 SDRAM memory. Our implementation of Linpack also supports clusters of QS22s, and was used to achieve a result of 11.1 TFLOPS on a cluster of 84 QS22 blades. We compare our results on a single BladeCenter QS22 with the base Linpack implementation without SPE acceleration to illustrate the benefits of our optimizations.