The use of BLAS3 in linear algebra on a parallel processor with a hierarchical memory
SIAM Journal on Scientific and Statistical Computing
The WY representation for products of householder matrices
SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
LAPACK's user's guide
Portable and efficient factorization algorithms on the IBM 3090/VF
ICS '89 Proceedings of the 3rd international conference on Supercomputing
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
Algorithm 539: Basic Linear Algebra Subprograms for Fortran Usage [F1]
ACM Transactions on Mathematical Software (TOMS)
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
ACM Transactions on Mathematical Software (TOMS)
The RISC BLAS: a blocked implementation of level 3 BLAS for RISC processors
ACM Transactions on Mathematical Software (TOMS)
Algorithmic Redistribution Methods for Block-Cyclic Decompositions
IEEE Transactions on Parallel and Distributed Systems
An Efficient Technique for Corner-Turn in SAR Image Reconstruction by Improving Cache Access
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Hi-index | 0.00 |
We describe an implementation of Level-3 BLAS (Basic Linear Algebra Subprograms) based on the use of the matrix-matrix multiplication kernel (GEMM). Blocking techniques are used to express the BLAS in terms of operations involving triangular blocks and calls to GEMM. A principal advantage of this approach is that most manufacturers provide at least an efficient serial version of GEMM so that our implementation can capture a significant percentage of the computer performance. A parameter which controls the blocking allows an efficient exploitation of the memory hierarchy of the various target computers. Furthermore, this blocked version of Level-3 BLAS is naturally parallel. We present results on the ALLIANT FX/80, the CONVEX C220, the CRAY-2, and the IBM 3090/VF. For GEMM, we always use the manufacturer-supplied versions. For the operations dealing with triangular blocks, we use assembler or tuned Fortran (using loop-unrolling) codes, depending on the efficiency of the available libraries.