Cache blocking for linear algebra algorithms

Authors:
Fred G. Gustavson
Affiliations:
IBM T.J. Watson Research Center, Emeritus, USA, Umeå University, Sweden
Venue:
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Year:
2011

Citing 22
Cited 0

New scalar and vector elementary functions for the IBM system/370

IBM Journal of Research and Development
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication

IBM Journal of Research and Development
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
ScaLAPACK user's guide

ScaLAPACK user's guide
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
The Art of Computer Programming Volumes 1-3 Boxed Set

The Art of Computer Programming Volumes 1-3 Boxed Set
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
A fully portable high performance minimal storage hybrid format Cholesky algorithm

ACM Transactions on Mathematical Software (TOMS)
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Implementation of mixed precision in solving systems of linear equations on the Cell processor: Research Articles

Concurrency and Computation: Practice & Experience
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

IEEE Transactions on Parallel and Distributed Systems
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
In-place transposition of rectangular matrices

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
New generalized data structures for matrices lead to a variety of high performance dense linear algebra algorithms

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion

ACM Transactions on Mathematical Software (TOMS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We briefly describe Cache Blocking for Dense Linear Algebra Algorithms on computer architectures since about 1985. Before that one had uniform memory architectures. The Cray I machine was the last holdout. We cover the where, when, what, how and why of Cache Blocking. Almost all computer manufacturers have recently (about seven years ago) dramatically changed their computer architectures to produce Multicore (MC) processors. It will be seen that the arrangement in memory of the submatrices Aij of A is a critical factor for obtaining high performance. From a practical point of view, this work is very important as it will allow existing codes using LAPACK and ScaLAPACK to remain usable by new versions of LAPACK and ScaLAPACK.