Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor

Authors:
Jakub Kurzak;Wesley Alvaro;Jack Dongarra
Affiliations:
Department of Electrical Engineering and Computer Science, University of Tennessee, United States;Department of Electrical Engineering and Computer Science, University of Tennessee, United States;Department of Electrical Engineering and Computer Science, University of Tennessee, United States and Computer Science and Mathematics Division, Oak Ridge National Laboratory, United States and Sc ...
Venue:
Parallel Computing
Year:
2009

Citing 19
Cited 12

Matrix multiplication via arithmetic progressions

Journal of Symbolic Computation - Special issue on computational algebraic complexity
LAPACK's user's guide

LAPACK's user's guide
Applied numerical linear algebra

Applied numerical linear algebra
ScaLAPACK user's guide

ScaLAPACK user's guide
Advanced compiler design and implementation

Advanced compiler design and implementation
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
Design Challenges of Technology Scaling

IEEE Micro
Analysis of Memory Hierarchy Performance of Block Data Layout

ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
Industry Trends: Chip Makers Turn to Multicore Processors

Computer
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Computer Architecture, Fourth Edition: A Quantitative Approach

Computer Architecture, Fourth Edition: A Quantitative Approach
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Implementation of mixed precision in solving systems of linear equations on the Cell processor: Research Articles

Concurrency and Computation: Practice & Experience
Cell broadband engine architecture and its first implementation: a performance view

IBM Journal of Research and Development
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

IEEE Transactions on Parallel and Distributed Systems

Adaptation of double-precision matrix multiplication to the cell broadband engine architecture

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Efficient biorthogonal Lanczos algorithm on message passing parallel computer

MTPP'10 Proceedings of the Second Russia-Taiwan conference on Methods and tools of parallel programming multicomputers
Parallelizing and optimizing a bioinformatics pairwise sequence alignment algorithm for many-core architecture

Parallel Computing
An approximate method for filtering out data dependencies with a sufficiently large distance between memory references

The Journal of Supercomputing
A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs

Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Interactive data mining on a CBEA cluster

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing
High performance power flow algorithm for symmetrical distribution networks with unbalanced loading

International Journal of Computer Applications in Technology
Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach

Parallel Computing
Direct approaches to exploit many-core architecture in bioinformatics

Future Generation Computer Systems
An (almost) direct deployment of the Fast Multipole Method on the Cell processor

The Journal of Supercomputing
Boost.SIMD: generic programming for portable SIMDization

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance, aside from special purpose accelerators like Graphics Processing Units (GPUs). In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crucial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C=C-AxB^T operation and the C=C-AxB operation for matrices of size 64x64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80% of the peak, using as little as 5.9 kB of storage for code and auxiliary data structures.