Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
IBM Journal of Research and Development
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
A survey of out-of-core algorithms in numerical linear algebra
External memory algorithms
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A Family of High-Performance Matrix Multiplication Algorithms
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms
Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Minimal data copy for dense linear algebra factorization
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Is cache-oblivious DGEMM viable?
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
High performance linear algebra algorithms: an introduction
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
High-Performance matrix multiply on a massively multithreaded fiteng1000 processor
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Hi-index | 0.00 |
We describe a model of hierarchical memories and we use it to determine an optimal strategy for blocking operand matrices of matrix multiplication. The model is an extension of an earlier related model by three of the authors. As before the model predicts the form of current, state-of-the-art L1 kernels. Additionally, it shows that current L1 kernels can continue to produce their high performance on operand matrices that are as large as the L2 cache. For a hierarchical memory with L memory levels (main memory and L-1 caches), our model reduces the number of potential matrix multiply algorithms from 6L to four. We use the shape of the matrix input operands to select one of our four algorithms. Previously four was 2L and the model was independent of the matrix operand shapes. Because of space limitations, we do not include performance results.