A parallel block implementation of Level-3 BLAS for MIMD vector processors

Authors:
Michel J. Daydé;Iain S. Duff;Antoine Petitet
Affiliations:
ENSEEIHT-IRIT, Toulouse, France;CERFACS, Toulouse, France;CERFACS, Toulouse, France
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
1994

Citing 10
Cited 4

The use of BLAS3 in linear algebra on a parallel processor with a hierarchical memory

SIAM Journal on Scientific and Statistical Computing
The WY representation for products of householder matrices

SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software (TOMS)
Parallel algorithms for dense linear algebra computations

SIAM Review
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK's user's guide

LAPACK's user's guide
Portable and efficient factorization algorithms on the IBM 3090/VF

ICS '89 Proceedings of the 3rd international conference on Supercomputing
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Algorithm 539: Basic Linear Algebra Subprograms for Fortran Usage [F1]

ACM Transactions on Mathematical Software (TOMS)

GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
The RISC BLAS: a blocked implementation of level 3 BLAS for RISC processors

ACM Transactions on Mathematical Software (TOMS)
Algorithmic Redistribution Methods for Block-Cyclic Decompositions

IEEE Transactions on Parallel and Distributed Systems
An Efficient Technique for Corner-Turn in SAR Image Reconstruction by Improving Cache Access

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe an implementation of Level-3 BLAS (Basic Linear Algebra Subprograms) based on the use of the matrix-matrix multiplication kernel (GEMM). Blocking techniques are used to express the BLAS in terms of operations involving triangular blocks and calls to GEMM. A principal advantage of this approach is that most manufacturers provide at least an efficient serial version of GEMM so that our implementation can capture a significant percentage of the computer performance. A parameter which controls the blocking allows an efficient exploitation of the memory hierarchy of the various target computers. Furthermore, this blocked version of Level-3 BLAS is naturally parallel. We present results on the ALLIANT FX/80, the CONVEX C220, the CRAY-2, and the IBM 3090/VF. For GEMM, we always use the manufacturer-supplied versions. For the operations dealing with triangular blocks, we use assembler or tuned Fortran (using loop-unrolling) codes, depending on the efficiency of the available libraries.