Matrix multiplication via arithmetic progressions
STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
A model for hierarchical memory
STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
The analysis of multidimensional searching in quad-trees
SODA '91 Proceedings of the second annual ACM-SIAM symposium on Discrete algorithms
Compiler blockability of numerical algorithms
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Locality of Reference in LU Decomposition with Partial Pivoting
SIAM Journal on Matrix Analysis and Applications
Recursion leads to automatic variable blocking for dense linear-algebra algorithms
IBM Journal of Research and Development
Advanced compiler design and implementation
Advanced compiler design and implementation
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
ACM Transactions on Mathematical Software (TOMS)
Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues
ACM Transactions on Mathematical Software (TOMS)
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Tuning Strassen's matrix multiplication for memory efficiency
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Accuracy and Stability of Numerical Algorithms
Accuracy and Stability of Numerical Algorithms
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Improving cache Performance Through Tiling and Data Alignment
IRREGULAR '97 Proceedings of the 4th International Symposium on Solving Irregularly Structured Problems in Parallel
A Characterization of Temporal Locality and Its Portability across Memory Hierarchies
ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
An Approach towards an Analytical Characterization of Locality and its Portability
IWIA '01 Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'01)
The Fastest Fourier Transform in the West
The Fastest Fourier Transform in the West
Performance Evaluation of Data Locality Exploitation (Ph.D. Thesis)
Performance Evaluation of Data Locality Exploitation (Ph.D. Thesis)
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Ordered vs. unordered: a comparison of parallelism and work-efficiency in irregular algorithms
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Hi-index | 0.00 |
The practical portability of a simple version of matrix multiplication is demonstrated.The multiplication algorithm is designed to exploit maximal and predictable locality at all levels of the memory hierarchy, with no a priori knowledge of the specific memory system organization for any particular machine.B y both simulations and execution on a number of platforms, we show that memory hierarchies portability does not sacrifice floating point performance; indeed, it is always a significant fraction of peak and, at least on one machine, is higher than the tuned routines by both ATLAS and vendor. The results are obtained by careful algorithm engineering, which combines a number of known as well as novel implementation ideas.This effort can be viewed as an experimental case study, complementary to the theoretical investigations on portability of cache performance begun by Bilardi and Peserico