Performance optimizations and bounds for sparse matrix-vector multiply
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Cell broadband engine architecture and its first implementation: a performance view
IBM Journal of Research and Development
Scientific computing Kernels on the cell processor
International Journal of Parallel Programming
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
A Performance Model of Dense Matrix Operations on Many-Core Architectures
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization
IEEE Transactions on Parallel and Distributed Systems
Moving Scientific Codes to Multicore Microprocessor CPUs
Computing in Science and Engineering
Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Roofline: an insightful visual performance model for multicore architectures
Communications of the ACM - A Direct Path to Dependable Software
Implementing a parallel matrix factorization library on the cell broadband engine
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Programming the Linpack benchmark for the IBM PowerXCell 8i processor
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Optimization of BLAS on the cell processor
HiPC'08 Proceedings of the 15th international conference on High performance computing
IBM BladeCenter QS22: design, performance, and utilization in hybrid computing systems
IBM Journal of Research and Development
Adaptation of double-precision matrix multiplication to the cell broadband engine architecture
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Vectorized OpenCL implementation of numerical integration for higher order finite elements
Computers & Mathematics with Applications
Hi-index | 0.00 |
The main delivery of this paper is a model-driven approach to adaptation of the double-precision matrix multiplication to architectures of blade systems based on two types of Cell processors. A hierarchical algorithm used for adaptation consists of four levels. The first level provides sharing computation among all the 16 SPE cores of the IBM BladeCenter QS21 or QS22. The second level corresponds to a macro-kernel, and is responsible for the data management in the main memory, as well as communication between the main memory and local stores of SPE cores. Each macro-kernel operation is implemented within the local store of an SPE core. The third level corresponds to a kernel of the algorithm; each kernel operation is implemented on a single SPE within its local store as a sequence of micro-kernel operations. The fourth level is a micro-kernel implemented within the register file of an SPE core. The proposed approach is based on two performance models. The purpose of the first model is optimization of communication across all 16 SPE cores of the IBM BladeCenter, including the main memory and local stores of SPEs. It is constructed as a function of size of matrix blocks. This model allows for selecting ''the best'' size of the macro-kernel. The second performance model is aiming at optimization of computations within a single SPE core, taking into account constraints on traffic between the local store and register file of SPE. The model accounts for such factors as size of local store, number of registers, properties of double-precision operations, balance between pipelines, etc. This model allows for selecting ''the best'' size of kernel and micro-kernel operations. The model-driven adaptation is followed by a series of systematic optimization steps. They include loop unrolling, double buffering on register and memory levels, as well as using NUMA library. The proposed adaptation and optimization steps are fully implemented in C language, without optimizing code manually. For the IBM QS21 system, which uses two Cell processors of the first generation, this implementation allows for achieving 27.24Gflop/s, which is 93.1% of the peak performance. This result is obtained for matrices of size 4096 by 4096. For the IBM QS22 system, based on PowerXCell 8i processors, the performance of double-precision arithmetic is extremely higher, so 184.4Gflop/s is achieved, as 90.0% of the peak performance. This result is reported for the matrix multiplication of size 15,872 by 15,872. The overall performance could be slightly improved by substituting the macro-kernel developed in this work with the highly optimized Cell BLAS dgemm_64x64 kernel.