A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Scalability issues affecting the design of a dense linear algebra library
Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Sparse matrix solvers on the GPU: conjugate gradients and multigrid
ACM SIGGRAPH 2003 Papers
High-performance linear algebra algorithms using new generalized data structures for matrices
IBM Journal of Research and Development
Introduction to the cell broadband engine architecture
IBM Journal of Research and Development
Cell broadband engine architecture and its first implementation: a performance view
IBM Journal of Research and Development
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Multi-threading and one-sided communication in parallel LU factorization
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Petascale computing with accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Adaptation of double-precision matrix multiplication to the cell broadband engine architecture
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
An (almost) direct deployment of the Fast Multipole Method on the Cell processor
The Journal of Supercomputing
Hi-index | 0.00 |
In this paper we present the design and implementation of the Linpack benchmark for the IBM BladeCenter QS22, which incorporates two IBM PowerXCell 8i 1 processors. The PowerXCell 8i is a new implementation of the Cell Broadband Engine™ 2 architecture and contains a set of special-purpose processing cores known as Synergistic Processing Elements (SPEs). The SPEs can be used as computational accelerators to augment the main PowerPC processor. The added computational capability of the SPEs results in a peak double precision floating point capability of 108.8 GFLOPS. We explain how we modified the standard open source implementation of Linpack to accelerate key computational kernels using the SPEs of the PowerXCell 8i processors. We describe in detail the implementation and performance of the computational kernels and also explain how we employed the SPEs for high-speed data movement and reformatting. The result of these modifications is a Linpack benchmark optimized for the IBM PowerXCell 8i processor that achieves 170.7 GFLOPS on a BladeCenter QS22 with 32 GB of DDR2 SDRAM memory. Our implementation of Linpack also supports clusters of QS22s, and was used to achieve a result of 11.1 TFLOPS on a cluster of 84 QS22 blades. We compare our results on a single BladeCenter QS22 with the base Linpack implementation without SPE acceleration to illustrate the benefits of our optimizations.