Communication-avoiding parallel strassen: implementation and performance

Authors:
Benjamin Lipshitz;Grey Ballard;James Demmel;Oded Schwartz
Affiliations:
UC Berkeley, Berkeley, California;UC Berkeley, Berkeley, California;UC Berkeley, Berkeley, California;UC Berkeley, Berkeley, California
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 13
Cited 2

A three-dimensional approach to parallel matrix multiplication

IBM Journal of Research and Development
A scalable parallel Strassen's matrix multiplication algorithm for distributed-memory computers

SAC '95 Proceedings of the 1995 ACM symposium on Applied computing
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm
Communication lower bounds for distributed-memory matrix multiplication

Journal of Parallel and Distributed Computing
Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms

Proceedings of the 2006 workshop on Memory system performance and correctness
Fast linear algebra is stable

Numerische Mathematik
Exascale computing technology challenges

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Graph expansion and communication costs of fast matrix multiplication: regular submission

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Improving communication performance in dense linear algebra via topology aware collectives

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Communication-optimal parallel algorithm for strassen's matrix multiplication

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

Graph expansion analysis for communication costs of fast rectangular matrix multiplication

MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
Communication costs of Strassen's matrix multiplication

Communications of the ACM

Quantified Score

Hi-index	0.02

Visualization

Abstract

Matrix multiplication is a fundamental kernel of many high performance and scientific computing applications. Most parallel implementations use classical O(n3) matrix multiplication, even though there exist algorithms with lower arithmetic complexity. We recently presented a new Communication-Avoiding Parallel Strassen algorithm (CAPS), based on Strassen's fast matrix multiplication, that minimizes communication (SPAA '12). It communicates asymptotically less than all classical and all previous Strassen-based algorithms, and it attains theoretical lower bounds. In this paper we show that CAPS is also faster in practice. We benchmark and compare its performance to previous algorithms on Hopper (Cray XE6), Intrepid (IBM BG/P), and Franklin (Cray XT4). We demonstrate significant speedups over previous algorithms both for large matrices and for small matrices on large numbers of processors. We model and analyze the performance of CAPS and predict its performance on future exascale platforms.