The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Cell broadband engine architecture and its first implementation: a performance view
IBM Journal of Research and Development
Moving Scientific Codes to Multicore Microprocessor CPUs
Computing in Science and Engineering
Programming the Linpack benchmark for the IBM PowerXCell 8i processor
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Vectorized OpenCL implementation of numerical integration for higher order finite elements
Computers & Mathematics with Applications
Hi-index | 0.00 |
This paper presents an approach to adaptation of the doubleprecision matrix multiplication to the architecture of Cell processors. The algorithm used for the adaptation on a single SPE is based on C = C+A*B operation performed for matrices of size 64×64; these matrices are further divided into smaller submatrices which correspond to micro-kernel operations. Our approach is based on a performance model which is constructed as a function of submatrix size. The model accounts for such factors as size of local storage, number of registers, properties of double-precision operations, balance between pipelines, etc. This approach allows us to take into consideration properties of the first generation of Cell processors and its successor - PowerXCell 8i. This adaptation is followed by an optimization phase which includes loop transformations, kernel implementation with SIMD instructions, and other transformations necessary to achieve balance between even and odd pipelines. Finally we present hand-tunings performed with the IBM Assembly Visualizer tool. The proposed adaptation and optimizations allow us to achieve about 96% of the peak performance.