Exploiting fast matrix multiplication within the level 3 BLAS

Authors:
Nicholas J. Higham
Affiliations:
Univ. of Manchester, Manchester, UK
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
1990

Citing 14
Cited 23

Numerical recipes: the art of scientific computing

Numerical recipes: the art of scientific computing
Matrix multiplication via arithmetic progressions

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
The use of BLAS3 in linear algebra on a parallel processor with a hierarchical memory

SIAM Journal on Scientific and Statistical Computing
Further comparisons of direct methods for computing stationary distributions of Markov chains

SIAM Journal on Algebraic and Discrete Methods
Algorithmics: theory & practice

Algorithmics: theory & practice
Extra high speed matrix multiplication on the Cray-2

SIAM Journal on Scientific and Statistical Computing
Algorithms (2nd ed.)

Algorithms (2nd ed.)
The accuracy of solutions to triangular systems

SIAM Journal on Numerical Analysis
Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software (TOMS)
Fast polar decomposition of an arbitrary matrix

SIAM Journal on Scientific and Statistical Computing
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
Algorithms for matrix multiplication

Algorithms for matrix multiplication

Multilinear algebra and parallel programming

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Stability of block algorithms with fast level-3 BLAS

ACM Transactions on Mathematical Software (TOMS)
Variants of matrix-matrix multiplication for Fortran-90

ACM SIGNUM Newsletter
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Implementation of Strassen's algorithm for matrix multiplication

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
High performance first principles method for complex magnetic properties

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Blocking Techniques in Numerical Software

ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
The aggregation and cancellation techniques as a practical tool for faster matrix multiplication

Theoretical Computer Science - Algebraic and numerical algorithm
Adaptive Strassen and ATLAS's DGEMM: A Fast Square-Matrix Multiply for Modern High-Performance Systems

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Adaptive Strassen's matrix multiplication

Proceedings of the 21st annual international conference on Supercomputing
Dense Linear Algebra over Word-Size Prime Fields: the FFLAS and FFPACK Packages

ACM Transactions on Mathematical Software (TOMS)
Massively Parallel Searching for Better Algorithms or How to Do a Cross Product with Five Multiplications

Scientific Programming
Misleading Performance Reporting in the Supercomputing Field

Scientific Programming
Adaptive Winograd's matrix multiplications

ACM Transactions on Mathematical Software (TOMS)
Generalized matrix inversion is not harder than matrix multiplication

Journal of Computational and Applied Mathematics
Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Using recursion to boost ATLAS's performance

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Optimized dense matrix multiplication on a many-core architecture

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

ACM Transactions on Mathematical Software (TOMS)
Stability of block LU factorization for block tridiagonal block H-matrices

Journal of Computational and Applied Mathematics
Fast matrix decomposition in F2

Journal of Computational and Applied Mathematics
High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

The Journal of Supercomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

The Level 3 BLAS (BLAS3) are a set of specifications of FORTRAN 77 subprograms for carrying out matrix multiplications and the solution of triangular systems with multiple right-hand sides. They are intended to provide efficient and portable building blocks for linear algebra algorithms on high-performance computers. We describe algorithms for the BLAS3 operations that are asymptotically faster than the conventional ones. These algorithms are based on Strassen's method for fast matrix multiplication, which is now recognized to be a practically useful technique once matrix dimensions exceed about 100. We pay particular attention to the numerical stability of these “fast BLAS3.” Error bounds are given and their significance is explained and illustrated with the aid of numerical experiments. Our conclusion is that the fast BLAS3, although not as strongly stable as conventional implementations, are stable enough to merit careful consideration in many applications.