Matrix multiplication via arithmetic progressions
Journal of Symbolic Computation - Special issue on computational algebraic complexity
LAPACK's user's guide
Applied numerical linear algebra
Applied numerical linear algebra
ScaLAPACK user's guide
Advanced compiler design and implementation
Advanced compiler design and implementation
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
ACM Transactions on Mathematical Software (TOMS)
Numerical Linear Algebra for High Performance Computers
Numerical Linear Algebra for High Performance Computers
Design Challenges of Technology Scaling
IEEE Micro
Analysis of Memory Hierarchy Performance of Block Data Layout
ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Computer Architecture, Fourth Edition: A Quantitative Approach
Computer Architecture, Fourth Edition: A Quantitative Approach
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Concurrency and Computation: Practice & Experience
Cell broadband engine architecture and its first implementation: a performance view
IBM Journal of Research and Development
Scientific computing Kernels on the cell processor
International Journal of Parallel Programming
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization
IEEE Transactions on Parallel and Distributed Systems
Adaptation of double-precision matrix multiplication to the cell broadband engine architecture
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Efficient biorthogonal Lanczos algorithm on message passing parallel computer
MTPP'10 Proceedings of the Second Russia-Taiwan conference on Methods and tools of parallel programming multicomputers
The Journal of Supercomputing
Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Interactive data mining on a CBEA cluster
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
High performance power flow algorithm for symmetrical distribution networks with unbalanced loading
International Journal of Computer Applications in Technology
Direct approaches to exploit many-core architecture in bioinformatics
Future Generation Computer Systems
An (almost) direct deployment of the Fast Multipole Method on the Cell processor
The Journal of Supercomputing
Boost.SIMD: generic programming for portable SIMDization
Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Hi-index | 0.00 |
Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance, aside from special purpose accelerators like Graphics Processing Units (GPUs). In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crucial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C=C-AxB^T operation and the C=C-AxB operation for matrices of size 64x64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80% of the peak, using as little as 5.9 kB of storage for code and auxiliary data structures.