Adaptation of double-precision matrix multiplication to the cell broadband engine architecture

Authors:
Krzysztof Rojek;Łukasz Szustak
Affiliations:
Czestochowa University of Technology, Institute of Computer and Information Sciences, Czestochowa, Poland;Czestochowa University of Technology, Institute of Computer and Information Sciences, Czestochowa, Poland
Venue:
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Year:
2009

Citing 5
Cited 2

The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Cell broadband engine architecture and its first implementation: a performance view

IBM Journal of Research and Development
Moving Scientific Codes to Multicore Microprocessor CPUs

Computing in Science and Engineering
Programming the Linpack benchmark for the IBM PowerXCell 8i processor

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor

Parallel Computing

Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing
Vectorized OpenCL implementation of numerical integration for higher order finite elements

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an approach to adaptation of the doubleprecision matrix multiplication to the architecture of Cell processors. The algorithm used for the adaptation on a single SPE is based on C = C+A*B operation performed for matrices of size 64×64; these matrices are further divided into smaller submatrices which correspond to micro-kernel operations. Our approach is based on a performance model which is constructed as a function of submatrix size. The model accounts for such factors as size of local storage, number of registers, properties of double-precision operations, balance between pipelines, etc. This approach allows us to take into consideration properties of the first generation of Cell processors and its successor - PowerXCell 8i. This adaptation is followed by an optimization phase which includes loop transformations, kernel implementation with SIMD instructions, and other transformations necessary to achieve balance between even and odd pipelines. Finally we present hand-tunings performed with the IBM Assembly Visualizer tool. The proposed adaptation and optimizations allow us to achieve about 96% of the peak performance.