How can we speed up matrix multiplication?
SIAM Review
Matrix multiplication via arithmetic progressions
STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
A Strassen-Newton algorithm for high-speed parallelizable matrix inversion
Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Exploiting fast matrix multiplication within the level 3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Using Strassen's algorithm to accelerate the solution of linear systems
The Journal of Supercomputing
Stability of block algorithms with fast level-3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Implementation of Strassen's algorithm for matrix multiplication
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
Tuning Strassen's matrix multiplication for memory efficiency
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Accuracy and Stability of Numerical Algorithms
Accuracy and Stability of Numerical Algorithms
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Algorithms for matrix multiplication
Algorithms for matrix multiplication
The aggregation and cancellation techniques as a practical tool for faster matrix multiplication
Theoretical Computer Science - Algebraic and numerical algorithm
Optimizing Sorting with Genetic Algorithms
Proceedings of the international symposium on Code generation and optimization
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Using recursion to boost ATLAS's performance
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Adaptive Winograd's matrix multiplications
ACM Transactions on Mathematical Software (TOMS)
Matrices as arrows!: a biproduct approach to typed linear algebra
MPC'10 Proceedings of the 10th international conference on Mathematics of program construction
Optimization techniques for small matrix multiplication
Theoretical Computer Science
AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Typing linear algebra: A biproduct-oriented approach
Science of Computer Programming
Hi-index | 0.00 |
Strassen's matrix multiplication (MM) has benefits with respect to any (highly tuned) implementations of MM because Strassen's reduces the total number of operations. Strassen achieved this operation reduction by replacing computationally expensive MMs with matrix additions (MAs). For architectures with simple memory hierarchies, having fewer operations directly translates into an efficient utilization of the CPU and, thus, faster execution. However, for modern architectures with complex memory hierarchies, the operations introduced by the MAs have a limited in-cache data reuse and thus poor memory-hierarchy utilization, thereby overshadowing the (improved) CPU utilization, and making Strassen's algorithm (largely) useless on its own. In this paper, we investigate the interaction between Strassen's effective performance and the memory-hierarchy organization. We show how to exploit Strassen's full potential across different architectures. We present an easy-to-use adaptive algorithm that combines a novel implementation of Strassen's idea with the MM from automatically tuned linear algebra software (ATLAS) or GotoBLAS. An additional advantage of our algorithm is that it applies to any size and shape matrices and works equally well with row or column major layout. Our implementation consists of introducing a final step in the ATLAS/GotoBLAS-installation process that estimates whether or not we can achieve any additional speedup using our Strassen's adaptation algorithm. Then we install our codes, validate our estimates, and determine the specific performance. We show that, by the right combination of Strassen's with ATLAS/GotoBLAS, our approach achieves up to 30%/22% speed-up versus ATLAS/GotoBLAS alone on modern high-performance single processors. We consider and present the complexity and the numerical analysis of our algorithm, and, finally, we show performance for 17 (uniprocessor) systems.