A family of high-performance matrix multiplication algorithms

Authors:
John A. Gunnels;Fred G. Gustavson;Greg M. Henry;Robert A. van de Geijn
Affiliations:
IBM T.J. Watson Research Center;IBM T.J. Watson Research Center;Intel Corporation;The University of Texas, Austin
Venue:
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Year:
2004

Citing 8
Cited 6

Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
A survey of out-of-core algorithms in numerical linear algebra

External memory algorithms
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A Family of High-Performance Matrix Multiplication Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing

Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Is cache-oblivious DGEMM viable?

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
New generalized data structures for matrices lead to a variety of high performance dense linear algebra algorithms

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
High performance linear algebra algorithms: an introduction

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
High-Performance matrix multiply on a massively multithreaded fiteng1000 processor

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a model of hierarchical memories and we use it to determine an optimal strategy for blocking operand matrices of matrix multiplication. The model is an extension of an earlier related model by three of the authors. As before the model predicts the form of current, state-of-the-art L1 kernels. Additionally, it shows that current L1 kernels can continue to produce their high performance on operand matrices that are as large as the L2 cache. For a hierarchical memory with L memory levels (main memory and L-1 caches), our model reduces the number of potential matrix multiply algorithms from 6L to four. We use the shape of the matrix input operands to select one of our four algorithms. Previously four was 2L and the model was independent of the matrix operand shapes. Because of space limitations, we do not include performance results.