Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the Synergistic Processing Element of the CELL Processor

Authors:
Wesley Alvaro;Jakub Kurzak;Jack Dongarra
Affiliations:
University of Tennessee, Knoxville, USA TN 37996;University of Tennessee, Knoxville, USA TN 37996;University of Tennessee, Knoxville, USA TN 37996 and Oak Ridge National Laboratory, , Oak Ridge, USA TN 37831 and University of Manchester, Manchester, UK M13 9PL
Venue:
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Year:
2008

Citing 5
Cited 1

Applied numerical linear algebra

Applied numerical linear algebra
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
Industry Trends: Chip Makers Turn to Multicore Processors

Computer
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers

Fast Elliptic-Curve Cryptography on the Cell Broadband Engine

AFRICACRYPT '09 Proceedings of the 2nd International Conference on Cryptology in Africa: Progress in Cryptology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance. In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crutial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C= Cï戮驴 A×BToperation and the C= Cï戮驴 A×Boperation for matrices of size 64 ×64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80 percent of the peak, using as little as 5.9 KB of storage for code and auxiliary data structures.