Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

Authors:
Gianfranco Bilardi;Paolo D'Alberto;Alexandru Nicolau
Affiliations:
-;-;-
Venue:
WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Year:
2001

Citing 28
Cited 3

Matrix multiplication via arithmetic progressions

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
A model for hierarchical memory

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
The analysis of multidimensional searching in quad-trees

SODA '91 Proceedings of the second annual ACM-SIAM symposium on Discrete algorithms
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Locality of Reference in LU Decomposition with Partial Pivoting

SIAM Journal on Matrix Analysis and Applications
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Advanced compiler design and implementation

Advanced compiler design and implementation
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues

ACM Transactions on Mathematical Software (TOMS)
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Tuning Strassen's matrix multiplication for memory efficiency

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Improving cache Performance Through Tiling and Data Alignment

IRREGULAR '97 Proceedings of the 4th International Symposium on Solving Irregularly Structured Problems in Parallel
A Characterization of Temporal Locality and Its Portability across Memory Hierarchies

ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
An Approach towards an Analytical Characterization of Locality and its Portability

IWIA '01 Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'01)
The Fastest Fourier Transform in the West

The Fastest Fourier Transform in the West
Performance Evaluation of Data Locality Exploitation (Ph.D. Thesis)

Performance Evaluation of Data Locality Exploitation (Ph.D. Thesis)

Adaptive Strassen and ATLAS's DGEMM: A Fast Square-Matrix Multiply for Modern High-Performance Systems

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Ordered vs. unordered: a comparison of parallelism and work-efficiency in irregular algorithms

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The practical portability of a simple version of matrix multiplication is demonstrated.The multiplication algorithm is designed to exploit maximal and predictable locality at all levels of the memory hierarchy, with no a priori knowledge of the specific memory system organization for any particular machine.B y both simulations and execution on a number of platforms, we show that memory hierarchies portability does not sacrifice floating point performance; indeed, it is always a significant fraction of peak and, at least on one machine, is higher than the tuned routines by both ATLAS and vendor. The results are obtained by careful algorithm engineering, which combines a number of known as well as novel implementation ideas.This effort can be viewed as an experimental case study, complementary to the theoretical investigations on portability of cache performance begun by Bilardi and Peserico