A three-dimensional approach to parallel matrix multiplication
IBM Journal of Research and Development
A scalable parallel Strassen's matrix multiplication algorithm for distributed-memory computers
SAC '95 Proceedings of the 1995 ACM symposium on Applied computing
Accuracy and Stability of Numerical Algorithms
Accuracy and Stability of Numerical Algorithms
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
A cellular computer to implement the kalman filter algorithm
A cellular computer to implement the kalman filter algorithm
Communication lower bounds for distributed-memory matrix multiplication
Journal of Parallel and Distributed Computing
Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms
Proceedings of the 2006 workshop on Memory system performance and correctness
Numerische Mathematik
Exascale computing technology challenges
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Graph expansion and communication costs of fast matrix multiplication: regular submission
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Improving communication performance in dense linear algebra via topology aware collectives
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Communication-optimal parallel algorithm for strassen's matrix multiplication
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Graph expansion analysis for communication costs of fast rectangular matrix multiplication
MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
Communication costs of Strassen's matrix multiplication
Communications of the ACM
Hi-index | 0.02 |
Matrix multiplication is a fundamental kernel of many high performance and scientific computing applications. Most parallel implementations use classical O(n3) matrix multiplication, even though there exist algorithms with lower arithmetic complexity. We recently presented a new Communication-Avoiding Parallel Strassen algorithm (CAPS), based on Strassen's fast matrix multiplication, that minimizes communication (SPAA '12). It communicates asymptotically less than all classical and all previous Strassen-based algorithms, and it attains theoretical lower bounds. In this paper we show that CAPS is also faster in practice. We benchmark and compare its performance to previous algorithms on Hopper (Cray XE6), Intrepid (IBM BG/P), and Franklin (Cray XT4). We demonstrate significant speedups over previous algorithms both for large matrices and for small matrices on large numbers of processors. We model and analyze the performance of CAPS and predict its performance on future exascale platforms.