The RISC BLAS: a blocked implementation of level 3 BLAS for RISC processors

Authors:
Michel J. Daydé;Iain S. Duff
Affiliations:
ENSEEIT-IRIT, Toulouse, France;CERFACS and Rutherford Appleton Lab., Oxon, England
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
1999

Citing 11
Cited 0

The use of BLAS3 in linear algebra on a parallel processor with a hierarchical memory

SIAM Journal on Scientific and Statistical Computing
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK's user's guide

LAPACK's user's guide
A set of high-performance level 3 BLAS structured and tuned for the IBM 3090 VF and implemented in Fortran 77

The Journal of Supercomputing
A parallel block implementation of Level-3 BLAS for MIMD vector processors

ACM Transactions on Mathematical Software (TOMS)
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues

ACM Transactions on Mathematical Software (TOMS)
LAPPACK Working Note No. 28: The IBM RISC System/6000 and Linear Algebra Operations

LAPPACK Working Note No. 28: The IBM RISC System/6000 and Linear Algebra Operations

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a version of the Level 3 BLAS which is designed to be efficient on RISC processors. This is an extension of previous studies by the authors and colleagues on a similar approach for efficient serial and parallel implementations on virtual-memory and shared-memory multiprocessors. All our codes are written in Fortran and use loop-unrolling, blocking, and copying to improve the performance. A blocking technique is used to express the BLAS in terms of operations involving triangular blocks and calls to the matrix-matrix multiplication kernel (GEMM). No manufacturer-supplied or assembler code is used. This blocked implementation uses the same blocking ideas as in our implementation for vector machines except that the ordering of loops is designed for efficient reuse of date held in cache and not necessarily for parallelization. All the codes are specifically tuned for RISC processors. The software also includes a tuned version of GEMM. A parameter which controls the blocking allows efficient exploitation of the memory hierarchy on the various target computers. We present results on a range of RISC-based workstations and multiprocessors: CRAY T3D, DEC 8400 5/300, HP 715/64, IBM SP2, MEIKO CS2-HA, SGI Power Challenge 10000, and SUN UltraSPARC-1 model 140. This implementation of the Level 3 BLAS is available on anonymous FTP, and we welcome input from users to improve and extend our BLAS implementation.