Communication-optimal parallel algorithm for strassen's matrix multiplication

Authors:
Grey Ballard;James Demmel;Olga Holtz;Benjamin Lipshitz;Oded Schwartz
Affiliations:
UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA;UC Berkeley and TU Berlin, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA
Venue:
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Year:
2012

Citing 20
Cited 7

Matrix multiplication via arithmetic progressions

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
Communication complexity of PRAMs

Theoretical Computer Science - Special issue: Fifteenth international colloquium on automata, languages and programming, Tampere, Finland, July 1988
A bridging model for parallel computation

Communications of the ACM
On the additive complexity of 2 x 2 matrix multiplication

Information Processing Letters
A three-dimensional approach to parallel matrix multiplication

IBM Journal of Research and Development
ScaLAPACK user's guide

ScaLAPACK user's guide
A scalable parallel Strassen's matrix multiplication algorithm for distributed-memory computers

SAC '95 Proceedings of the 1995 ACM symposium on Applied computing
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm
Communication lower bounds for distributed-memory matrix multiplication

Journal of Parallel and Distributed Computing
Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms: Research Articles

Concurrency and Computation: Practice & Experience
Group-theoretic Algorithms for Matrix Multiplication

FOCS '05 Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science
Fast matrix multiplication is stable

Numerische Mathematik
Fast linear algebra is stable

Numerische Mathematik
Combining building blocks for parallel multi-level matrix multiplication

Parallel Computing
Graph expansion and communication costs of fast matrix multiplication: regular submission

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
The Future of Computing Performance: Game Over or Next Level?

The Future of Computing Performance: Game Over or Next Level?
Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction

IPPS '93 Proceedings of the 1993 Seventh International Parallel Processing Symposium
Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Communication-avoiding parallel strassen: implementation and performance

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Graph expansion and communication costs of fast matrix multiplication

Journal of the ACM (JACM)
Graph expansion analysis for communication costs of fast rectangular matrix multiplication

MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
Work-efficient matrix inversion in polylogarithmic time

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Communication optimal parallel multiplication of sparse random matrices

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Communication costs of Strassen's matrix multiplication

Communications of the ACM

Quantified Score

Hi-index	0.02

Visualization

Abstract

Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA '11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of processors ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms.