High-Performance Matrix Multiplication Algorithms for Architectures withHierarchical Memories

Authors:
John A. Gunnels;Greg M. Henry;Robert A. van de Geijn
Affiliations:
-;-;-
Venue:
High-Performance Matrix Multiplication Algorithms for Architectures withHierarchical Memories
Year:
2001

Citing 0
Cited 2

64-bit floating-point FPGA matrix multiplication

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Speeding up Nek5000 with autotuning and specialization

Proceedings of the 24th ACM International Conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

During the last half-decade, a number of research efforts have centered around the development of software for generating automatically tuned matrix multiplication kernels. These include the PHiPAC project and the ATLAS project. The software products of both projects employ brute force to search a parameter space for blockings that accommodate multiple levels of memory hierarchy. We take a different approach. Using a simple model of hierarchical memories we employ mathematics to determine a locally-optimal strategy for blocking matrices. The theoretical results show that, depending on the shape of the matrices involved, different strategies are locally-optimal. Rather than determining a blocking strategy at library generation time, the theoretical results show that ideally one should pursue a heuristic that allows the blocking strategy to be determined dynamically at run-time as a function of the shapes of the operands. When the resulting family of algorithms is combined with a highly optimized inner-kernel for a small matrix multiplication, the approach yields performance that is superior to that of methods that automatically tune such kernels. Preliminary results, for the Intel Pentium (R) III processor, support the theoretical insights.