A Family of High-Performance Matrix Multiplication Algorithms

Authors:
John A. Gunnels;Greg M. Henry;Robert A. van de Geijn
Affiliations:
-;-;-
Venue:
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Year:
2001

Citing 10
Cited 19

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Formal Methods for High-Performance Linear Algebra Libraries

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
BLAS Based on Block Data Structures

BLAS Based on Block Data Structures
A Flexible Class of Parallel Matrix Multiplication Algorithms

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
GEEM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark

GEEM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark

FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Formal derivation of algorithms: The triangular sylvester equation

ACM Transactions on Mathematical Software (TOMS)
Architecture of an automatically tuned linear algebra library

Parallel Computing
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
The science of deriving dense linear algebra algorithms

ACM Transactions on Mathematical Software (TOMS)
High performance dense linear algebra on a spatially distributed processor

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

ACM Transactions on Mathematical Software (TOMS)
Updating an LU Factorization with Pivoting

ACM Transactions on Mathematical Software (TOMS)
A performance comparison of matrix solvers on Compaq Alpha, Intel Itanium, and Intel Itanium II processors

ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartI
Remote parallel model reduction of linear time-invariant systems made easy

VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
New data structures for matrices and specialized inner kernels: low overhead for high performance

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
A family of high-performance matrix multiplication algorithms

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Performance of linear algebra code: intel xeon EM64T and ItaniumII case examples

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part IV
SAR image reconstruction and autofocus by compressed sensing

Digital Signal Processing
Toward scalable matrix multiply on multithreaded architectures

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
High-Performance matrix multiply on a massively multithreaded fiteng1000 processor

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

During the last half-decade, a number of research efforts have centered around developing software for generating automatically tuned matrix multiplication kernels. These include the PHiPAC project and the ATLAS project. The software end-products of both projects employ brute force to search a parameter space for blockings that accommodate multiple levels of memory hierarchy. We take a different approach: using a simple model of hierarchical memories we employ mathematics to determine a locally-optimal strategy for blocking matrices. The theoretical results show that, depending on the shape of the matrices involved, different strategies are locally-optimal. Rather than determining a blocking strategy at library generation time, the theoretical results show that, ideally, one should pursue a heuristic that allows the blocking strategy to be determined dynamically at run-time as a function of the shapes of the operands. When the resulting family of algorithms is combined with a highly optimized inner-kernel for a small matrix multiplication, the approach yields performance that is superior to that of methods that automatically tune such kernels. Preliminary results, for the Intel Pentium (R) III processor, support the theoretical insights.